# Curriculum learning

> Source: https://aiwiki.ai/wiki/curriculum_learning
> Updated: 2026-06-23
> Categories: Deep Learning, Machine Learning, Training & Optimization
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

Curriculum learning is a [training](/wiki/training) strategy for [machine learning](/wiki/machine_learning) models in which training examples are presented in a meaningful, easy-to-hard order rather than at random, mirroring how human education introduces simple concepts before complex ones. The idea was formalized by Yoshua Bengio, Jerome Louradour, Ronan Collobert, and Jason Weston at the 2009 International Conference on Machine Learning (ICML), whose paper opens with the observation that "humans and animals learn much better when the examples are not randomly presented but organized in a meaningful order which illustrates gradually more concepts, and gradually more complex ones." [1] Presenting data this way has been shown to speed up convergence and improve generalization on many tasks, and the technique has since been applied to [deep learning](/wiki/deep_learning), [reinforcement learning](/wiki/reinforcement_learning), [computer vision](/wiki/computer_vision), [natural language processing](/wiki/natural_language_processing), and [large language model](/wiki/large_language_model) pretraining. [1][12]

Bengio and colleagues advanced two central hypotheses: that curriculum learning accelerates convergence, and that for the non-convex objectives typical of deep learning it can guide optimization toward better local minima. [1] In their framing, a curriculum is a sequence of training criteria that gradually increases the weight on more difficult examples until the target distribution is reached.

## Historical Background

### When did curriculum learning originate?

The intellectual roots of curriculum learning extend back to cognitive science and developmental psychology. In 1993, Jeffrey Elman published "Learning and Development in Neural Networks: The Importance of Starting Small" in *Cognition*, in which he trained [recurrent neural networks](/wiki/recurrent_neural_network) to process complex sentences involving relative clauses and number agreement. [2] Elman found that networks with full adult-level capacity could not learn the grammar at all; training succeeded only when networks began with limited working memory that gradually "matured" to the adult state. [2] As Elman put it, "successful learning may depend on starting small." [2] His key insight was that developmental restrictions on resources may serve as a necessary prerequisite for mastering certain complex domains: by starting with simpler input, the network developed internal representations of basic grammatical structure that constrained the solution space for subsequent, more complex learning.

This idea also connects to the concept of shaping in behavioral psychology, introduced by B.F. Skinner, where animals learn complex behaviors through a sequence of progressively more challenging tasks with appropriate rewards at each stage.

### How did Bengio et al. (2009) formalize curriculum learning?

The term "curriculum learning" was formally introduced by [Yoshua Bengio](/wiki/yoshua_bengio), Jerome Louradour, Ronan Collobert, and Jason Weston in their 2009 paper presented at the 26th International Conference on Machine Learning (ICML), where it appeared on pages 41-48 of the proceedings. [1] The paper drew a direct analogy to human education: just as students learn arithmetic before calculus, [neural networks](/wiki/neural_network) might benefit from structured exposure to training data ordered by difficulty.

Bengio et al. advanced two central hypotheses. First, curriculum learning accelerates the convergence of the training process. Second, for non-convex optimization criteria (which characterize most deep learning problems), curriculum learning can guide the model toward better local minima. [1] The authors connected curriculum learning to continuation methods, a family of global optimization strategies that solve a sequence of smoothed approximations to a non-convex problem, gradually increasing the problem's complexity. [1]

#### Experimental Results

The 2009 paper reported experiments on two tasks: [1]

- **Shape recognition**: A three-hidden-layer neural network was trained via [stochastic gradient descent](/wiki/stochastic_gradient_descent_sgd) on shape classification datasets (BasicShapes and GeomShapes). Training with a curriculum that introduced geometrically simple shapes before complex ones led to improved classification accuracy and faster convergence compared to random ordering. [1]
- **Language modeling**: A ranking-based [language model](/wiki/language_model) was trained on Wikipedia text. The curriculum filtered training examples by vocabulary frequency: the model first trained only on sentences whose words fell within the 5,000 most frequent words (reducing the training set from 631 million to 270 million examples), then expanded to the 10,000 most frequent words (370 million examples), and finally to the full vocabulary. This staged approach improved the quality of the learned representations. [1]

Across both tasks, curriculum-trained models achieved better generalization and converged faster than models trained on randomly shuffled data. [1]

## Core Concepts and Framework

### What are the two components of a curriculum?

Modern curriculum learning methods can be decomposed into two core components, as described in the survey by Wang, Chen, and Zhu (2021), which frames curriculum design as a "Difficulty Measurer + Training Scheduler" pipeline, and in the survey by Soviany et al. (2022): [12][14]

| Component | Role | Examples |
|---|---|---|
| **Difficulty Measurer** | Assigns a difficulty score to each training sample | Sentence length, loss value, confidence score from a pretrained model, distance to decision boundary, noise level |
| **Training Scheduler** | Determines when harder samples are introduced during training | Fixed schedule (step functions), continuous pacing functions, exponential pacing, linear pacing |

The difficulty measurer answers the question "which examples are easy and which are hard?" while the training scheduler answers "when should the model see harder examples, and how many more?" [12] Together, these two components define a complete curriculum.

### Predefined vs. Automatic Curricula

Curriculum learning methods broadly fall into two categories:

- **Predefined curriculum learning**: The difficulty measure and training schedule are designed manually using domain-specific heuristics. For example, in [neural machine translation](/wiki/machine_translation), sentence length is a natural proxy for difficulty. In image classification, clarity or the absence of occlusion can indicate easier examples. These approaches are simple to implement but require expert knowledge.
- **Automatic curriculum learning**: The curriculum is learned or adapted during training based on feedback from the model itself. This category includes self-paced learning, teacher-student methods, and reinforcement learning-based curriculum design. Automatic methods remove the need for hand-crafted heuristics but introduce additional computational overhead.

### Taxonomy of Curriculum Types

Soviany et al. (2022) proposed a four-level taxonomy of curriculum learning approaches: [14]

| Curriculum Type | Description | Example |
|---|---|---|
| **Data-level** | Training examples are ordered from easy to hard | Presenting short sentences before long sentences in NMT |
| **Model-level** | The modeling capacity of the network is gradually increased | Starting with a shallow network and progressively adding layers |
| **Task-level** | The complexity of the learning task itself increases during training | Training on simple subtasks before the full task in RL |
| **Objective-level** | The model optimizes toward an increasingly complex objective | Beginning with a simple loss function and transitioning to a harder one |

## Self-Paced Learning

### What is self-paced learning (Kumar, Packer, and Koller, 2010)?

Self-paced learning (SPL) was introduced by M. Pawan Kumar, Benjamin Packer, and Daphne Koller in their 2010 [NeurIPS](/wiki/neurips) paper "Self-Paced Learning for Latent Variable Models," published on pages 1189-1197 of the proceedings. [3] While Bengio's original curriculum learning relies on a predefined, fixed ordering of examples, SPL lets the model itself decide which examples are easy or hard at each stage of training. [3]

The SPL algorithm works iteratively: at each step, it simultaneously selects a subset of easy samples (those with low training loss) and updates the model parameters. A weight parameter controls how many samples are selected, and this weight is annealed over time so that progressively more difficult samples are included until the entire training set is being used. [3]

Kumar et al. demonstrated that SPL outperformed existing methods on four tasks: object localization, noun phrase coreference resolution, motif finding, and handwritten digit recognition. [3] The strength of SPL lies in its adaptiveness: rather than relying on a fixed external notion of difficulty, the model's own learning state determines which examples to prioritize.

### Self-Paced Curriculum Learning (Jiang et al., 2015)

Lu Jiang, Deyu Meng, Qian Zhao, Shiguang Shan, and Alexander Hauptmann unified curriculum learning and self-paced learning in their 2015 AAAI paper "Self-Paced Curriculum Learning" (SPCL), which appeared on pages 2694-2700. [4] They observed that CL and SPL each have complementary strengths: CL incorporates prior knowledge about sample difficulty (from an instructor), while SPL adapts to the model's current learning state (the student's perspective). [4]

SPCL combines both signals into a single optimization framework. In the analogy to human education, SPCL corresponds to an "instructor-student collaborative" learning mode, as opposed to "instructor-driven" (pure CL) or "student-driven" (pure SPL). [4] The unified approach demonstrated improved performance over either method alone on tasks including matrix factorization and multimedia event detection. [4]

## Automated Curriculum Design

### Graves et al. (2017): Multi-Armed Bandits

Alex Graves, Marc G. Bellemare, Jacob Menick, Remi Munos, and Koray Kavukcuoglu presented "Automated Curriculum Learning for Neural Networks" at ICML 2017, where it appeared as PMLR 70:1311-1320. [5] Their approach frames curriculum selection as a [multi-armed bandit](/wiki/multi_armed_bandit) problem. A nonstationary bandit algorithm selects which tasks or data subsets to present to the network at each training step, using a reward signal based on how much the network learns from each sample. [5]

The authors considered multiple signals of learning progress, including the rate of increase in prediction accuracy and the rate of increase in network complexity. [5] Experiments on [LSTM](/wiki/long_short-term_memory_lstm) networks across three curricula demonstrated that automated curriculum selection could, in some cases, halve the training time needed to reach a satisfactory performance level. [5]

### Teacher-Student Curriculum Learning (Matiisen et al., 2017)

Tambet Matiisen, Avital Oliver, Taco Cohen, and John Schulman proposed Teacher-Student Curriculum Learning (TSCL), where a separate "Teacher" selects training tasks for a "Student" network. [6] The Teacher monitors the Student's learning progress and preferentially assigns tasks where the Student is improving most quickly (where the slope of the learning curve is highest). To counter the problem of catastrophic forgetting, the Teacher also revisits tasks where the Student's performance has declined. [6]

TSCL matched or surpassed hand-crafted curricula on two tasks: addition of decimal numbers with LSTMs and navigation in Minecraft. [6] The framework was published in IEEE Transactions on Neural Networks and Learning Systems in 2020. [6]

### RL-Based Curriculum Generation

Several methods use [reinforcement learning](/wiki/reinforcement_learning) to train a curriculum policy. In these approaches, the teacher is itself an RL agent whose action space consists of selecting tasks or difficulty levels for the student, and whose reward is derived from the student's learning progress. This creates a bilevel optimization problem where the outer loop trains the teacher and the inner loop trains the student. Such methods have been applied to robotics, game playing, and dialogue systems.

## Applications by Domain

### How is curriculum learning used in computer vision?

Curriculum learning has been widely applied in [computer vision](/wiki/computer_vision) tasks. Hacohen and Weinshall (2019) conducted a thorough study of curriculum learning for [convolutional neural network](/wiki/convolutional_neural_network) image classification in their ICML paper "On The Power of Curriculum Learning in Training Deep Networks," published on pages 2535-2544. [9] They proposed two methods for scoring example difficulty: transfer scoring (using confidence scores from a pretrained teacher network) and self-taught scoring (using the network's own confidence from an initial training pass). [9] Combined with an exponential pacing function that gradually expands the training set, their method consistently accelerated learning and improved final accuracy on standard benchmarks. [9]

In medical imaging, curriculum learning has been applied to address challenges with limited labeled data and noisy annotations. The CURVETE framework (Curriculum Learning and Progressive Self-supervised Training) introduced a progressive curriculum based on sample decomposition granularity for medical image classification tasks, including brain tumor detection, knee X-ray analysis, and mammography screening.

Curriculum learning has also been employed for [object detection](/wiki/object_detection) and [image segmentation](/wiki/image_segmentation), where training can begin with images containing clearly visible, well-separated objects before introducing crowded scenes with heavy occlusion.

### Why is curriculum learning a natural fit for machine translation?

Curriculum learning has proven particularly natural for [neural machine translation](/wiki/machine_translation) (NMT), where sentence length and vocabulary rarity provide intuitive difficulty measures.

**Kocmi and Bojar (2017)** conducted the first systematic study of curriculum learning for NMT. [7] They explored two types of data ordering for a Czech-English translation system: organizing minibatches so that each batch contains sentences similar in some linguistic aspect, and gradually introducing more complex sentence types as training progresses. [7]

**Platanios et al. (2019)** introduced competence-based curriculum learning for NMT in their NAACL paper (pages 1162-1172). [8] Their framework defines a model competence function that increases over training time, and at each step only trains on examples whose estimated difficulty falls within the model's current competence level. The difficulty of each sentence pair is estimated based on features such as length and word rarity. This approach achieved up to a 70% reduction in training time and improvements of up to 2.2 [BLEU](/wiki/bleu) points on standard translation benchmarks, for both [recurrent neural network](/wiki/recurrent_neural_network) and [Transformer](/wiki/transformer) architectures. [8]

### Does curriculum learning help large language model pretraining?

Curriculum learning has attracted significant interest for pretraining large language models, though results have been mixed and context-dependent.

**Sequence length warmup (Li et al., 2022)**: Conglong Li and colleagues at Microsoft, working within the [DeepSpeed](/wiki/deepspeed) framework, proposed using sequence length as a curriculum dimension for [GPT](/wiki/gpt) model pretraining. [15] Their approach starts training with shorter sequences and progressively increases sequence length. This enabled stable training with 8x larger batch sizes and 4x larger learning rates for [GPT-2](/wiki/gpt-2) models (117M and 1.5B parameters), whereas the baseline approach diverged. [15] To match baseline evaluation results, the method reduced the required training tokens and wall clock time by up to 2.2x and 3.7x, respectively. [15] For a 125M-parameter GPT-3-style model, the same technique enabled an 8x larger batch size and a 40x larger learning rate while retaining 99% of the zero-shot accuracy on 11 tasks using 10x less data and 17x less time. [15]

**Multi-stage pretraining**: Several recent LLMs, including [OLMo](/wiki/olmo) 2, [Phi-4](/wiki/phi), and others, have adopted a two-phase curriculum strategy. [17] The first phase trains on a data mixture dominated by large-scale web data. The second phase (sometimes called "mid-training") shifts to a mixture that primarily consists of high-quality, curated, and synthetic data; OLMo 2, for example, anneals on the smaller, higher-quality "Dolmino Mix 1124" set in its second stage. [17] This approach treats data quality as the curriculum dimension.

**Mixed findings at scale**: Research presented at NeurIPS 2024 found that curriculum learning has limited impact during post-training stages such as [supervised fine-tuning](/wiki/fine_tuning) (SFT) and reinforcement learning from human feedback ([RLHF](/wiki/rlhf)), where standard random sampling often performs competitively. However, combining curriculum learning with appropriate [learning rate](/wiki/learning_rate) schedules and weight averaging can produce synergistic benefits during pretraining, improving downstream benchmark accuracy by over 1.5% through data reordering alone. [18]

### How does curriculum learning apply to reinforcement learning?

Curriculum learning is especially valuable in reinforcement learning, where agents often face sparse reward signals and must learn complex behaviors that are nearly impossible to acquire from scratch.

Narvekar et al. (2020) published a comprehensive framework and survey in the Journal of Machine Learning Research (volume 21, article 181, pages 1-50), classifying curriculum approaches for RL domains. [11] In RL, a curriculum can operate at multiple levels: ordering individual experience samples, sequencing subtasks, or progressively modifying the environment. [11]

**Automatic Domain Randomization (ADR)**: [OpenAI](/wiki/openai) applied a form of automatic curriculum learning in their work on solving a Rubik's Cube with a robotic hand (2019). [10] ADR starts with a single, non-randomized simulation environment and "automatically generates a distribution over randomized environments of ever-increasing difficulty" (friction, object size, lighting, and other physical parameters) as the agent's performance improves. [10] Each time the agent surpasses a performance threshold, the environment becomes more challenging. The resulting policy transferred successfully to a physical robot hand, solving a half-scrambled cube (about 15 face rotations) 60% of the time and a maximally difficult full scramble (26 face rotations) 20% of the time. [10]

**Reverse Curriculum Learning**: Xi et al. (2024) introduced reverse curriculum reinforcement learning (the R3 method) for training LLMs on reasoning tasks at ICML 2024. [16] Instead of progressing from easy to hard, the method starts from goal states (the end of a worked demonstration) and slides the start state backward toward the beginning, generating a sequence of increasingly challenging starting states so that the agent always has a feasible path toward the solution. [16] Using Llama2-7B, R3 surpassed an RL baseline on eight reasoning tasks by 4.1 points on average. [16]

## Comparison of Approaches

The following table summarizes major curriculum learning approaches across domains:

| Approach | Year | Authors | Domain | Difficulty Measure | Key Contribution |
|---|---|---|---|---|---|
| Starting Small | 1993 | Elman | Language (RNN) | Working memory capacity | Showed that limited initial capacity aids grammar learning |
| Curriculum Learning | 2009 | Bengio, Louradour, Collobert, Weston | Vision, NLP | Predefined (shape complexity, vocabulary frequency) | Formalized CL; linked to continuation methods |
| Self-Paced Learning | 2010 | Kumar, Packer, Koller | Latent variable models | Training loss (model-determined) | Let the model select its own easy examples |
| Self-Paced Curriculum Learning | 2015 | Jiang, Meng, Zhao, Shan, Hauptmann | Multimedia, matrix factorization | Combined prior knowledge and model loss | Unified CL and SPL into one framework |
| Automated CL (Bandit) | 2017 | Graves, Bellemare, Menick, Munos, Kavukcuoglu | Sequence learning (LSTM) | Learning progress signals | Multi-armed bandit for curriculum selection |
| Teacher-Student CL | 2017 | Matiisen, Oliver, Cohen, Schulman | RL (Minecraft, arithmetic) | Student learning progress | Teacher network selects tasks for student |
| CL for NMT | 2017 | Kocmi, Bojar | Machine translation | Sentence length, vocabulary rarity | First CL study for NMT |
| Power of CL | 2019 | Hacohen, Weinshall | Image classification (CNN) | Transfer scoring, self-taught scoring | Scoring and pacing functions for vision |
| Competence-based CL | 2019 | Platanios, Stretcu, Neubig, Poczos, Mitchell | Machine translation | Competence function (length, rarity) | 70% training time reduction, +2.2 BLEU |
| ADR (OpenAI) | 2019 | OpenAI | Robotics (RL) | Environment randomization complexity | Sim-to-real transfer for dexterous manipulation |
| CL for RL (Survey) | 2020 | Narvekar, Peng, Leonetti, Sinapov, Taylor, Stone | Reinforcement learning | Various (task-level, sample-level) | Comprehensive CL framework for RL |
| Sequence Length Warmup | 2022 | Li et al. (Microsoft/DeepSpeed) | LLM pretraining (GPT) | Sequence length | Up to 3.7x wall clock speedup for GPT-2 |
| Reverse Curriculum RL | 2024 | Xi et al. | LLM reasoning | Distance from goal state | Backward curriculum for reasoning tasks |

## When Does Curriculum Learning Help?

The effectiveness of curriculum learning is not universal. Wu, Dyer, and Neyshabur (2021) published "When Do Curricula Work?" at ICLR 2021 (as an oral presentation), providing one of the most rigorous analyses of when curriculum learning provides genuine benefits. [13] Their headline finding was that curriculum, but not anti-curriculum, can improve performance under a limited training-time budget or in the presence of noisy data, and that much of the benefit traces to the dynamic training set size rather than the ordering itself. [13]

### Conditions Where CL Helps

- **Limited training budget**: When the model cannot train for as many epochs as it needs, curriculum learning front-loads learning on the most informative (easy) examples, extracting more value per update. Under constrained training time, curriculum-trained models consistently outperformed randomly trained models. [13]
- **Noisy training data**: When training labels contain errors or the data is otherwise corrupted, curriculum learning provides a natural form of denoising. Easy examples are more likely to be correctly labeled, so training on them first builds a stronger foundation before the model encounters noisy samples. Neither anti-curriculum nor random ordering provided similar benefits under noise. [13]
- **Non-convex optimization landscapes**: As Bengio et al. (2009) originally hypothesized, starting with simple examples produces smoother loss surfaces early in training, guiding optimization toward better regions of the parameter space. [1]
- **Sparse rewards in RL**: When the reward signal is rare or delayed, structured task sequencing is often necessary for the agent to learn at all. [11]

### Conditions Where CL May Not Help

- **Standard benchmarks with sufficient training time**: Wu et al. found that on typical benchmark datasets (CIFAR-10, CIFAR-100) under standard training conditions, curricula provided only marginal benefits. Randomly ordered samples performed comparably to curriculum-ordered ones, suggesting that the primary benefit in some cases comes from the dynamic training set size rather than the ordering itself. [13]
- **[Post-training](/wiki/post-training) (SFT and RLHF)**: Recent work has found that curriculum learning has limited impact during supervised fine-tuning and reinforcement learning from human feedback for large language models, where the optimal schedule varies across datasets. [18]
- **Anti-curriculum sometimes works**: In certain domains, training on the hardest examples first can be beneficial. The ACCAN method for speech recognition, for example, trains on examples with the lowest signal-to-noise ratio first and achieves competitive results.

### The Role of Pacing

An important nuance from the research is that the pacing function (how quickly the training set expands) may matter more than the ordering itself in some settings. Wu et al. observed that a "random curriculum" (expanding the training set size over time but in random order) sometimes performed as well as a true difficulty-ordered curriculum, implying that the gradual increase in data diversity contributes independently to the training benefit. [13]

## Defining Difficulty

A central challenge in curriculum learning is measuring example difficulty. Various approaches have been proposed:

| Difficulty Measure | Description | Applicable Domain |
|---|---|---|
| **Sentence length** | Longer sentences are harder | NMT, language modeling |
| **Vocabulary rarity** | Sentences with rare words are harder | NLP tasks |
| **Training loss** | High-loss examples are harder (self-paced) | Any supervised task |
| **Transfer scoring** | Low confidence from a pretrained model indicates difficulty | Image classification |
| **Self-taught scoring** | Low confidence from initial training pass | Image classification |
| **Edit distance** | Distance from prototypical examples | Text classification |
| **Image resolution / noise** | Blurry or noisy images are harder | Computer vision |
| **Number of objects** | Scenes with more objects are harder | Object detection |
| **Task complexity** | More steps or subgoals required | Reinforcement learning |
| **Domain randomization** | More environmental variation is harder | Sim-to-real robotics |

## Relationship to Other Techniques

Curriculum learning intersects with several related training strategies:

- **[Transfer learning](/wiki/transfer_learning)**: Both curriculum learning and transfer learning leverage easier or related tasks to improve learning on a target task. In curriculum learning, the easy and hard tasks share the same objective; in transfer learning, the source and target tasks may differ.
- **[Data augmentation](/wiki/data_augmentation)**: Progressive data augmentation can be viewed as a form of curriculum learning, where augmentation intensity increases over training. For example, gradually increasing the severity of random crops, color jitter, or noise injection creates an implicit difficulty curriculum.
- **Hard example mining**: While curriculum learning starts with easy examples, hard example mining (used in methods like Online Hard Example Mining for object detection) focuses training on the hardest examples. These approaches are complementary: a curriculum can start easy and shift toward hard example emphasis later in training.
- **[Knowledge distillation](/wiki/knowledge_distillation)**: A teacher model can provide soft labels that make difficult examples easier for a student model, effectively creating a curriculum through label smoothing.
- **Continuation methods**: As noted by Bengio et al. (2009), curriculum learning can be viewed as a form of continuation method where the training objective is gradually transformed from a smooth, easy-to-optimize function to the true (potentially non-convex) objective. [1]

## Practical Considerations

For practitioners considering curriculum learning, several practical points emerge from the literature:

1. **Start simple**: If domain-specific difficulty measures are available (sentence length in NLP, image clarity in vision), a predefined curriculum is straightforward to implement and often effective.
2. **Loss-based difficulty is a strong baseline**: Using training loss from an initial pass over the data as a difficulty score (self-paced learning) works across domains without requiring domain expertise. [3]
3. **Pacing matters**: Exponential pacing functions (gradually expanding the effective training set) tend to outperform linear or step-function schedules. [9]
4. **Interaction with learning rate**: Recent research has shown that curriculum learning interacts with learning rate schedules. Combining moderate learning rate decay with data curriculum ordering and weight averaging can produce improvements that exceed either technique alone. [18]
5. **Evaluate carefully**: Because curriculum learning changes the training distribution over time, standard training curves may be misleading. Always evaluate on a held-out validation set with the full data distribution. [13]

## References

1. Bengio, Y., Louradour, J., Collobert, R., & Weston, J. (2009). "Curriculum learning." *Proceedings of the 26th International Conference on Machine Learning (ICML)*, 41-48. https://dl.acm.org/doi/10.1145/1553374.1553380
2. Elman, J. L. (1993). "Learning and development in neural networks: the importance of starting small." *Cognition*, 48(1), 71-99. https://doi.org/10.1016/0010-0277(93)90058-4
3. Kumar, M. P., Packer, B., & Koller, D. (2010). "Self-paced learning for latent variable models." *Advances in Neural Information Processing Systems (NeurIPS)*, 1189-1197. https://proceedings.neurips.cc/paper/2010/hash/e57c6b956a6521b28495f2886ca0977a-Abstract.html
4. Jiang, L., Meng, D., Zhao, Q., Shan, S., & Hauptmann, A. (2015). "Self-paced curriculum learning." *Proceedings of the AAAI Conference on Artificial Intelligence*, 29(1), 2694-2700. https://ojs.aaai.org/index.php/AAAI/article/view/9608
5. Graves, A., Bellemare, M. G., Menick, J., Munos, R., & Kavukcuoglu, K. (2017). "Automated curriculum learning for neural networks." *Proceedings of the 34th International Conference on Machine Learning (ICML)*, PMLR 70:1311-1320. https://proceedings.mlr.press/v70/graves17a.html
6. Matiisen, T., Oliver, A., Cohen, T., & Schulman, J. (2020). "Teacher-student curriculum learning." *IEEE Transactions on Neural Networks and Learning Systems*, 31(9), 3732-3740. https://ieeexplore.ieee.org/document/8827566
7. Kocmi, T. & Bojar, O. (2017). "Curriculum learning and minibatch bucketing in neural machine translation." *Proceedings of RANLP 2017*. https://aclanthology.org/R17-1050/
8. Platanios, E. A., Stretcu, O., Neubig, G., Poczos, B., & Mitchell, T. M. (2019). "Competence-based curriculum learning for neural machine translation." *Proceedings of NAACL-HLT 2019*, 1162-1172. https://aclanthology.org/N19-1119/
9. Hacohen, G. & Weinshall, D. (2019). "On the power of curriculum learning in training deep networks." *Proceedings of the 36th International Conference on Machine Learning (ICML)*, 2535-2544. https://proceedings.mlr.press/v97/hacohen19a.html
10. OpenAI, Akkaya, I., et al. (2019). "Solving Rubik's Cube with a robot hand." *arXiv preprint arXiv:1910.07113*. https://arxiv.org/abs/1910.07113
11. Narvekar, S., Peng, B., Leonetti, M., Sinapov, J., Taylor, M. E., & Stone, P. (2020). "Curriculum learning for reinforcement learning domains: a framework and survey." *Journal of Machine Learning Research*, 21(181), 1-50. https://jmlr.org/papers/v21/20-212.html
12. Wang, X., Chen, Y., & Zhu, W. (2021). "A survey on curriculum learning." *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 44(9), 4555-4576. https://arxiv.org/abs/2010.13166
13. Wu, X., Dyer, E., & Neyshabur, B. (2021). "When do curricula work?" *Proceedings of the International Conference on Learning Representations (ICLR)*. https://arxiv.org/abs/2012.03107
14. Soviany, P., Ionescu, R. T., Rota, P., & Sebe, N. (2022). "Curriculum learning: a survey." *International Journal of Computer Vision*, 130, 1526-1565. https://doi.org/10.1007/s11263-022-01611-x
15. Li, C., Zhang, M., & He, Y. (2022). "Curriculum learning: a regularization method for efficient and stable billion-scale GPT model pre-training." *Advances in Neural Information Processing Systems (NeurIPS)*. arXiv:2108.06084. https://arxiv.org/abs/2108.06084
16. Xi, Z., et al. (2024). "Training large language models for reasoning through reverse curriculum reinforcement learning." *Proceedings of the 41st International Conference on Machine Learning (ICML)*. arXiv:2402.05808. https://arxiv.org/abs/2402.05808
17. OLMo Team, et al. (2024). "2 OLMo 2 Furious." *arXiv preprint arXiv:2501.00656*. https://arxiv.org/abs/2501.00656
18. Kim, et al. (2024). "Curriculum learning, learning-rate schedules, and weight averaging in language model training." *NeurIPS 2024*.