See also: Machine learning terms
Staged training (also called multi-stage training or phased training) is a strategy in machine learning where a model is trained in two or more distinct sequential phases, each with its own objective, data mix, optimizer settings, or loss function. Outputs of an earlier stage (typically learned weights) become the starting point for the next, so the model gradually acquires capabilities rather than learning everything at once. Staged training underlies most modern large language model recipes, where a single model passes through pretraining, continued pretraining, supervised fine-tuning, and preference alignment before deployment.
The approach is also widely used outside language modeling: two-stage object detectors split detection into proposal generation and classification, retrieval systems split candidate generation from reranking, and diffusion model distillation trains a slow teacher and then a fast student. The common thread is that breaking training into phases lets practitioners pick the data, hyperparameters, and loss best suited to each phase rather than averaging them across one monolithic run.
Staged training overlaps with several adjacent ideas, but the differences matter when reading papers.
| concept | what it does | how it differs from staged training |
|---|---|---|
| Curriculum learning | Presents examples in easy-to-hard order during a single training run | A subset of staged training when the curriculum is implemented as discrete phases; otherwise it varies sample weights continuously |
| Transfer learning | Reuses a model trained on one task as the starting point for another | Always staged in spirit (pretrain then transfer), but the term is broader and includes reuse of frozen features |
| Multi-task learning | Trains on multiple objectives simultaneously | Objectives are mixed in every batch rather than executed in sequence |
| Continual learning | Trains on a stream of tasks while avoiding catastrophic forgetting | Concerned with retaining old skills as new ones arrive, not with reaching a final fixed model |
| Fine-tuning | Updates a pretrained model on a downstream task | Refers to the second stage of a two-stage workflow; staged training is the umbrella term |
In practice the distinction between curriculum learning and staged training is fuzzy. Bengio and colleagues (2009) framed curriculum learning broadly as any training strategy that orders examples from easy to hard, which can be implemented either with continuous weighting or with sharp stage boundaries.
Staged training in deep networks predates the modern era. In 2006 Hinton, Osindero, and Teh introduced the deep belief network and a fast greedy algorithm that learned the network one layer at a time, treating each layer as a restricted Boltzmann machine and using contrastive divergence to fit it before stacking the next. A separate fine-tuning pass then adjusted all the weights together using a contrastive version of the wake-sleep algorithm. Bengio and collaborators extended the idea in 2007 with greedy layerwise training of generic autoencoders, showing that unsupervised pretraining of each layer made it much easier to train deep networks that backpropagation alone could not optimize.
This layerwise pretraining era is often credited with reviving interest in deep neural networks. It was largely superseded around 2012 by better activation functions (ReLU), better weight initialization (Xavier and He), and the practical demonstration that large convolutional networks could be trained end-to-end on ImageNet. Layerwise pretraining mostly faded from vision, but the broader idea of breaking training into phases survived and re-emerged forcefully with the rise of pretraining and fine-tuning workflows in NLP after 2018.
In 2009 Bengio, Louradour, Collobert, and Weston published Curriculum Learning at ICML, formalizing the intuition that humans and animals learn better when examples are organized from simple to complex. They framed the approach as a continuation method for non-convex optimization, arguing that curricula speed convergence and improve the quality of local minima. Self-paced learning (Kumar, Packer, and Koller, 2010) made the curriculum dynamic by letting the model itself decide which examples it was ready for, ranking them by current loss rather than by an external difficulty measure.
The dominant modern usage of the term comes from LLM training pipelines, which are typically organized as a strict sequence of stages. Each stage produces a checkpoint that becomes the initialization for the next.
| stage | objective | typical data | reference |
|---|---|---|---|
| Pretraining | Next-token prediction on raw text | Web crawl, books, code, math (trillions of tokens) | Brown et al. 2020 (GPT-3), Touvron et al. 2023 (Llama) |
| Continued pretraining | Same objective on a narrower corpus | Domain text such as code, biomedical, legal, or non-English | Gururangan et al. 2020 (DAPT/TAPT), CodeLlama |
| Long-context extension | Next-token prediction with longer sequences | Long documents, books, code repositories | Llama 3.1 (8K to 128K), DeepSeek V3 (4K to 32K to 128K) |
| Annealing | Final pretraining steps with high-quality data and a decaying learning rate | Curated educational text, code, math | Llama 3 (final 40M tokens), FineWeb-Edu |
| Supervised fine-tuning (SFT) | Imitate curated instruction-response pairs | 10K to 1M examples written or filtered by humans | Wei et al. 2021 (FLAN), Ouyang et al. 2022 (InstructGPT) |
| Preference alignment | Match human preference rankings | Pairwise preference data, possibly AI-generated | Christiano et al. 2017, Ouyang et al. 2022, Rafailov et al. 2023 (DPO) |
| Tool use and agentic fine-tuning | Learn to call tools, follow long traces, debug | Synthetic agent trajectories, real tool logs | Toolformer, Gorilla, more recent agent-focused tunes |
Not every model passes through every stage, and the boundaries are sometimes blurred. A modern recipe usually contains at minimum pretraining, SFT, and a preference-alignment phase.
Pretraining itself is increasingly broken into substages with different data mixes. The Llama 3 405B training paper describes three pretraining stages explicitly: an initial stage on roughly 15.6 trillion tokens with an 8K context window, a long-context stage that gradually extends the window to 128K over six steps using about 800B tokens, and an annealing stage on the final 40M tokens during which the learning rate is linearly decayed to zero and the data mix is shifted toward very high-quality sources. Annealing on small amounts of high-quality code and math data improved Llama 3 8B's performance on the GSM8K and MATH validation sets by 24.0% and 6.4% respectively.
DeepSeek V3 follows a similar staged pretraining recipe with a two-stage context-length extension from 4K to 32K to 128K. Microsoft's Phi-3 and Phi-4 push the use of curated and synthetic data further: Phi-4 includes an explicit "midtraining" stage that extends the context window from 4K to 16K and changes the data mixture to emphasize synthetic reasoning data. Across all of these recipes the pattern is the same. Generic web data trains the bulk of the parameters, then progressively cleaner and more targeted data shapes the final checkpoint.
Xie and colleagues (2023) introduced DoReMi (Domain Reweighting with Minimax Optimization), which uses a small proxy model to find a better mixture of domain weights before training the full-sized model. The method improved few-shot accuracy on The Pile by 6.5 percentage points and reached the baseline accuracy in 2.6 times fewer training steps. Methods like DoReMi sit between staged training and curriculum learning: the proxy run is itself a stage that produces inputs for the main run.
Gururangan and colleagues showed in Don't Stop Pretraining (ACL 2020) that running additional pretraining on domain text (DAPT) and on task-specific unlabeled text (TAPT) reliably improves downstream performance. A sequential DAPT then TAPT regimen tended to give the strongest results. Code Llama applied the same principle at scale, taking a Llama 2 base model and continuing pretraining on a mixture of source code, natural language about code, and technical documentation. This produced a code-specialized base model that could then be fine-tuned for instruction following or for code infilling.
Supervised fine-tuning teaches the model to imitate curated instruction-response pairs. Datasets such as Alpaca (52K self-instruct examples generated from GPT-3.5), Vicuna (collected from real ShareGPT conversations), and OpenAssistant (crowd-sourced human dialogues) all aim to nudge a base model from raw next-token prediction toward useful, conversational behavior. SFT is typically the cheapest stage in terms of compute but one of the most sensitive to data quality; small high-quality datasets (LIMA used only 1,000 examples) often outperform much larger noisy ones.
The alignment stage shapes outputs to match human or AI preferences. Three families dominate:
| method | year | how it works |
|---|---|---|
| RLHF | Christiano 2017, Ouyang 2022 | Train a reward model on pairwise human preferences, then optimize the policy with PPO against that reward |
| DPO | Rafailov 2023 | Skip the explicit reward model and optimize a closed-form objective directly on preference pairs |
| Constitutional AI and RLAIF | Bai 2022 | Use an AI critic guided by a written constitution to generate preferences, then train as in RLHF |
Ouyang and colleagues' InstructGPT paper (2022) made the three-step recipe canonical: an SFT stage on labeler demonstrations, a reward model trained on labeler rankings, and a PPO stage that optimizes the policy against that reward. Notably the 1.3B-parameter InstructGPT model produced outputs that human raters preferred to those of the 175B GPT-3, despite having about 100 times fewer parameters. Rafailov and colleagues introduced DPO in 2023, showing that the standard RLHF objective can be solved with a simple classification loss directly on preference pairs, eliminating the explicit reward model and the on-policy sampling that PPO requires. DPO matches or exceeds PPO-based RLHF on summarization, sentiment control, and single-turn dialogue while being substantially simpler to implement. Bai and colleagues' Constitutional AI work uses a written set of principles and an AI critic to generate harmlessness preferences, reducing reliance on human labelers for that dimension.
More recent recipes add a fifth stage that teaches the model to use tools, call APIs, and operate in agentic loops. This involves training on long trajectories that include tool calls, observations, and reasoning steps. Models such as Toolformer, Gorilla, and the various function-calling tunes of Claude and GPT-4 variants are products of this stage.
Extending the context window after the bulk of pretraining is now standard. Training from scratch on very long sequences is wasteful because attention scales quadratically and most documents are short. Instead, models are trained on shorter contexts and then continued on longer ones. The Llama 3.1 paper extended the context from 8K to 128K in six progressively longer stages using about 800 billion tokens. Methods such as YaRN (Peng et al., 2023) and LongRoPE (Ding et al., 2024) modify the rotary positional embedding so that the longer-context fine-tuning stage converges in far fewer tokens than naive extension would require. YaRN reports needing roughly 10 times fewer tokens and 2.5 times fewer training steps than earlier RoPE extension methods.
Staged training is not unique to LLMs. Several other areas of machine learning use distinct sequential phases.
| domain | stages |
|---|---|
| Object detection | Faster R-CNN trains a Region Proposal Network (RPN) and a Fast R-CNN detection head, often alternating updates between the two during a four-step alternating training scheme |
| Information retrieval | Two-stage pipelines first generate a large set of candidates with a fast retriever (e.g. BM25 or a bi-encoder) then rerank them with a heavier cross-encoder |
| Diffusion model distillation | Train a teacher diffusion model in the usual way, then distill its many-step sampling process into a fast student via progressive distillation (Salimans and Ho, 2022) or consistency distillation (Song et al., 2023) |
| Self-supervised vision | Pretrain a backbone with a self-supervised objective such as MAE, DINO, or SimCLR, then fine-tune on a labeled task |
| Speech recognition | Acoustic model pretraining (e.g. wav2vec 2.0) followed by language-model rescoring at decode time |
| Reinforcement learning | Behavior cloning on demonstrations, then RL fine-tuning with a learned or environmental reward |
Faster R-CNN's two-stage detector remains a useful reference: a Region Proposal Network produces about 300 candidate boxes per image, and a separate head then classifies and regresses each proposal. The two networks share convolutional features but are trained with different losses, and the original paper used a four-step alternating optimization that toggled between updating the RPN and the detection head.
Progressive distillation in diffusion models is another clear case. The teacher is a slow many-step sampler; the student is trained to match two consecutive teacher steps in a single step, halving the required steps. Repeating this log2(N) times collapses an N-step sampler down to one or two steps. Consistency distillation later achieved a one-step FID of 2.82 on CIFAR-10, outperforming progressive distillation, by directly training the student to be self-consistent across the diffusion trajectory.
There is no single proven explanation, but several mechanisms plausibly contribute.
First, the curriculum effect. Easier early objectives put parameters in a region of weight space from which harder objectives are easier to optimize. Bengio and colleagues argued in their 2009 paper that this is essentially a continuation method: solve a smoothed problem first, then anneal toward the harder one.
Second, data quality matters more in late stages. Web-scale pretraining tolerates noisy, repetitive, or low-quality text because the model needs broad coverage. Late-stage data, especially during annealing or fine-tuning, has an outsized effect on the final checkpoint and is therefore worth curating heavily. The annealing results from Llama 3 (24% gain on GSM8K from a few tens of millions of high-quality tokens) illustrate this dramatically.
Third, different phases benefit from different hyperparameters. Pretraining typically uses a high learning rate and a long warmup; SFT runs at a much lower learning rate for far fewer steps; preference alignment uses smaller batches and tight KL constraints to keep the policy close to the reference model. Folding all of this into one optimization run would force compromises.
Fourth, distinct objectives align better than one mixed objective. RLHF rewards, next-token cross-entropy, and instruction imitation loss are not on the same scale and have different gradient noise properties. Training them in sequence rather than as a weighted sum lets each finish before the next begins.
A few practical points come up repeatedly in published recipes and in implementation discussions.
Save a checkpoint at the end of every stage. Later stages often regress on some metrics relative to the previous stage (alignment tax is the canonical example: SFT and RLHF can degrade base-model knowledge and calibration), so the previous checkpoint is the natural baseline.
Lower the learning rate for later stages. Pretraining typically uses peak rates around 1e-4 to 3e-4 for large models; SFT runs at 1e-5 to 1e-6; preference alignment runs lower still. Re-using the pretraining schedule for fine-tuning is a common bug that destroys learned representations.
Validate at the end of each stage on the metrics that matter for that stage. Pretraining is usually checked with held-out perplexity and a small set of zero-shot benchmarks; SFT with instruction-following evaluations like AlpacaEval or MT-Bench; alignment with preference-rate evaluations and safety probes.
Watch for stage-boundary loss spikes. The first few hundred steps of a new stage often show sharp loss jumps as the model adjusts to a new data distribution or objective. Short warmups and slightly lower initial learning rates help.
Staged training has real downsides. The most obvious is engineering complexity: each stage needs its own data pipeline, hyperparameter sweep, evaluation harness, and checkpoint management. A four-stage recipe can have ten times the moving parts of a single-stage one.
The hyperparameter space also explodes. Learning rate, batch size, warmup, weight decay, optimizer state, and data mix are usually tuned per stage. Decisions interact across stages (a longer SFT stage may demand a tighter KL during alignment, for example), and ablating these interactions on full-scale runs is expensive.
Catastrophic forgetting is another risk. Each stage can erode capabilities learned in earlier stages, which is why preference alignment typically anchors against a reference model with a KL penalty and why long-context extension is often kept short. The literature on continual learning addresses this directly, but it remains an active concern in production LLM training.
Finally, stage boundaries can introduce loss spikes and instability. Switching the data distribution or objective abruptly forces the optimizer to adapt, and large transformer models occasionally diverge during these transitions. Practitioners mitigate this with short warmups, gradient clipping, and conservative learning-rate schedules across boundaries.