Staged training

See also: Machine learning terms

Staged training (also called multi-stage training or phased training) is a strategy in machine learning where a model is trained in two or more distinct sequential phases, each with its own objective, data mix, optimizer settings, or loss function. Outputs of an earlier stage (typically learned weights) become the starting point for the next, so the model gradually acquires capabilities rather than learning everything at once. Staged training underlies most modern large language model recipes, where a single model passes through pretraining, continued pretraining, supervised fine-tuning, and preference alignment before deployment.

The approach is also widely used outside language modeling: two-stage object detectors split detection into proposal generation and classification, retrieval systems split candidate generation from reranking, and diffusion model distillation trains a slow teacher and then a fast student. The common thread is that breaking training into phases lets practitioners pick the data, hyperparameters, and loss best suited to each phase rather than averaging them across one monolithic run.

Staged training overlaps with several adjacent ideas, but the differences matter when reading papers.

concept	what it does	how it differs from staged training
Curriculum learning	Presents examples in easy-to-hard order during a single training run	A subset of staged training when the curriculum is implemented as discrete phases; otherwise it varies sample weights continuously
Transfer learning	Reuses a model trained on one task as the starting point for another	Always staged in spirit (pretrain then transfer), but the term is broader and includes reuse of frozen features
Multi-task learning	Trains on multiple objectives simultaneously	Objectives are mixed in every batch rather than executed in sequence
Continual learning	Trains on a stream of tasks while avoiding catastrophic forgetting	Concerned with retaining old skills as new ones arrive, not with reaching a final fixed model
Fine-tuning	Updates a pretrained model on a downstream task	Refers to the second stage of a two-stage workflow; staged training is the umbrella term

In practice the distinction between curriculum learning and staged training is fuzzy. Bengio and colleagues (2009) framed curriculum learning broadly as any training strategy that orders examples from easy to hard, which can be implemented either with continuous weighting or with sharp stage boundaries.

historical background

Staged training in deep networks predates the modern era. In 2006 Hinton, Osindero, and Teh introduced the deep belief network and a fast greedy algorithm that learned the network one layer at a time, treating each layer as a restricted Boltzmann machine and using contrastive divergence to fit it before stacking the next. A separate fine-tuning pass then adjusted all the weights together using a contrastive version of the wake-sleep algorithm. Bengio and collaborators extended the idea in 2007 with greedy layerwise training of generic autoencoders, showing that unsupervised pretraining of each layer made it much easier to train deep networks that backpropagation alone could not optimize.

This layerwise pretraining era is often credited with reviving interest in deep neural networks. It was largely superseded around 2012 by better activation functions (ReLU), better weight initialization (Xavier and He), and the practical demonstration that large convolutional networks could be trained end-to-end on ImageNet. Layerwise pretraining mostly faded from vision, but the broader idea of breaking training into phases survived and re-emerged forcefully with the rise of pretraining and fine-tuning workflows in NLP after 2018.

In 2009 Bengio, Louradour, Collobert, and Weston published Curriculum Learning at ICML, formalizing the intuition that humans and animals learn better when examples are organized from simple to complex. They framed the approach as a continuation method for non-convex optimization, arguing that curricula speed convergence and improve the quality of local minima. Self-paced learning (Kumar, Packer, and Koller, 2010) made the curriculum dynamic by letting the model itself decide which examples it was ready for, ranking them by current loss rather than by an external difficulty measure.

modern LLM training stages

The dominant modern usage of the term comes from LLM training pipelines, which are typically organized as a strict sequence of stages. Each stage produces a checkpoint that becomes the initialization for the next.

stage	objective	typical data	reference
Pretraining	Next-token prediction on raw text	Web crawl, books, code, math (trillions of tokens)	Brown et al. 2020 (GPT-3), Touvron et al. 2023 (Llama)
Continued pretraining	Same objective on a narrower corpus	Domain text such as code, biomedical, legal, or non-English	Gururangan et al. 2020 (DAPT/TAPT), CodeLlama
Long-context extension	Next-token prediction with longer sequences	Long documents, books, code repositories	Llama 3.1 (8K to 128K), DeepSeek V3 (4K to 32K to 128K)
Annealing	Final pretraining steps with high-quality data and a decaying learning rate	Curated educational text, code, math	Llama 3 (final 40M tokens), FineWeb-Edu
Supervised fine-tuning (SFT)	Imitate curated instruction-response pairs	10K to 1M examples written or filtered by humans	Wei et al. 2021 (FLAN), Ouyang et al. 2022 (InstructGPT)
Preference alignment	Match human preference rankings	Pairwise preference data, possibly AI-generated	Christiano et al. 2017, Ouyang et al. 2022, Rafailov et al. 2023 (DPO)
Tool use and agentic fine-tuning	Learn to call tools, follow long traces, debug	Synthetic agent trajectories, real tool logs	Toolformer, Gorilla, more recent agent-focused tunes

Not every model passes through every stage, and the boundaries are sometimes blurred. A modern recipe usually contains at minimum pretraining, SFT, and a preference-alignment phase.

pretraining and its substages

Pretraining itself is increasingly broken into substages with different data mixes. The Llama 3 405B training paper describes three pretraining stages explicitly: an initial stage on roughly 15.6 trillion tokens with an 8K context window, a long-context stage that gradually extends the window to 128K over six steps using about 800B tokens, and an annealing stage on the final 40M tokens during which the learning rate is linearly decayed to zero and the data mix is shifted toward very high-quality sources. Annealing on small amounts of high-quality code and math data improved Llama 3 8B's performance on the GSM8K and MATH validation sets by 24.0% and 6.4% respectively.

DeepSeek V3 follows a similar staged pretraining recipe with a two-stage context-length extension from 4K to 32K to 128K. Microsoft's Phi-3 and Phi-4 push the use of curated and synthetic data further: Phi-4 includes an explicit "midtraining" stage that extends the context window from 4K to 16K and changes the data mixture to emphasize synthetic reasoning data. Across all of these recipes the pattern is the same. Generic web data trains the bulk of the parameters, then progressively cleaner and more targeted data shapes the final checkpoint.

Xie and colleagues (2023) introduced DoReMi (Domain Reweighting with Minimax Optimization), which uses a small proxy model to find a better mixture of domain weights before training the full-sized model. The method improved few-shot accuracy on The Pile by 6.5 percentage points and reached the baseline accuracy in 2.6 times fewer training steps. Methods like DoReMi sit between staged training and curriculum learning: the proxy run is itself a stage that produces inputs for the main run.

continued pretraining and domain adaptation

Gururangan and colleagues showed in Don't Stop Pretraining (ACL 2020) that running additional pretraining on domain text (DAPT) and on task-specific unlabeled text (TAPT) reliably improves downstream performance. A sequential DAPT then TAPT regimen tended to give the strongest results. Code Llama applied the same principle at scale, taking a Llama 2 base model and continuing pretraining on a mixture of source code, natural language about code, and technical documentation. This produced a code-specialized base model that could then be fine-tuned for instruction following or for code infilling.

supervised fine-tuning

Supervised fine-tuning teaches the model to imitate curated instruction-response pairs. Datasets such as Alpaca (52K self-instruct examples generated from GPT-3.5), Vicuna (collected from real ShareGPT conversations), and OpenAssistant (crowd-sourced human dialogues) all aim to nudge a base model from raw next-token prediction toward useful, conversational behavior. SFT is typically the cheapest stage in terms of compute but one of the most sensitive to data quality; small high-quality datasets (LIMA used only 1,000 examples) often outperform much larger noisy ones.

preference alignment

The alignment stage shapes outputs to match human or AI preferences. Three families dominate:

method	year	how it works
RLHF	Christiano 2017, Ouyang 2022	Train a reward model on pairwise human preferences, then optimize the policy with PPO against that reward
DPO	Rafailov 2023	Skip the explicit reward model and optimize a closed-form objective directly on preference pairs
Constitutional AI and RLAIF	Bai 2022	Use an AI critic guided by a written constitution to generate preferences, then train as in RLHF

Ouyang and colleagues' InstructGPT paper (2022) made the three-step recipe canonical: an SFT stage on labeler demonstrations, a reward model trained on labeler rankings, and a PPO stage that optimizes the policy against that reward. Notably the 1.3B-parameter InstructGPT model produced outputs that human raters preferred to those of the 175B GPT-3, despite having about 100 times fewer parameters. Rafailov and colleagues introduced DPO in 2023, showing that the standard RLHF objective can be solved with a simple classification loss directly on preference pairs, eliminating the explicit reward model and the on-policy sampling that PPO requires. DPO matches or exceeds PPO-based RLHF on summarization, sentiment control, and single-turn dialogue while being substantially simpler to implement. Bai and colleagues' Constitutional AI work uses a written set of principles and an AI critic to generate harmlessness preferences, reducing reliance on human labelers for that dimension.

tool use and agentic fine-tuning

More recent recipes add a fifth stage that teaches the model to use tools, call APIs, and operate in agentic loops. This involves training on long trajectories that include tool calls, observations, and reasoning steps. Models such as Toolformer, Gorilla, and the various function-calling tunes of Claude and GPT-4 variants are products of this stage.

long-context extension as a stage

Extending the context window after the bulk of pretraining is now standard. Training from scratch on very long sequences is wasteful because attention scales quadratically and most documents are short. Instead, models are trained on shorter contexts and then continued on longer ones. The Llama 3.1 paper extended the context from 8K to 128K in six progressively longer stages using about 800 billion tokens. Methods such as YaRN (Peng et al., 2023) and LongRoPE (Ding et al., 2024) modify the rotary positional embedding so that the longer-context fine-tuning stage converges in far fewer tokens than naive extension would require. YaRN reports needing roughly 10 times fewer tokens and 2.5 times fewer training steps than earlier RoPE extension methods.

staged training outside language models

Staged training is not unique to LLMs. Several other areas of machine learning use distinct sequential phases.

domain	stages
Object detection	Faster R-CNN trains a Region Proposal Network (RPN) and a Fast R-CNN detection head, often alternating updates between the two during a four-step alternating training scheme
Information retrieval	Two-stage pipelines first generate a large set of candidates with a fast retriever (e.g. BM25 or a bi-encoder) then rerank them with a heavier cross-encoder
Diffusion model distillation	Train a teacher diffusion model in the usual way, then distill its many-step sampling process into a fast student via progressive distillation (Salimans and Ho, 2022) or consistency distillation (Song et al., 2023)
Self-supervised vision	Pretrain a backbone with a self-supervised objective such as MAE, DINO, or SimCLR, then fine-tune on a labeled task
Speech recognition	Acoustic model pretraining (e.g. wav2vec 2.0) followed by language-model rescoring at decode time
Reinforcement learning	Behavior cloning on demonstrations, then RL fine-tuning with a learned or environmental reward

Faster R-CNN's two-stage detector remains a useful reference: a Region Proposal Network produces about 300 candidate boxes per image, and a separate head then classifies and regresses each proposal. The two networks share convolutional features but are trained with different losses, and the original paper used a four-step alternating optimization that toggled between updating the RPN and the detection head.

Progressive distillation in diffusion models is another clear case. The teacher is a slow many-step sampler; the student is trained to match two consecutive teacher steps in a single step, halving the required steps. Repeating this log2(N) times collapses an N-step sampler down to one or two steps. Consistency distillation later achieved a one-step FID of 2.82 on CIFAR-10, outperforming progressive distillation, by directly training the student to be self-consistent across the diffusion trajectory.

why staged training works

There is no single proven explanation, but several mechanisms plausibly contribute.

First, the curriculum effect. Easier early objectives put parameters in a region of weight space from which harder objectives are easier to optimize. Bengio and colleagues argued in their 2009 paper that this is essentially a continuation method: solve a smoothed problem first, then anneal toward the harder one.

Second, data quality matters more in late stages. Web-scale pretraining tolerates noisy, repetitive, or low-quality text because the model needs broad coverage. Late-stage data, especially during annealing or fine-tuning, has an outsized effect on the final checkpoint and is therefore worth curating heavily. The annealing results from Llama 3 (24% gain on GSM8K from a few tens of millions of high-quality tokens) illustrate this dramatically.

Third, different phases benefit from different hyperparameters. Pretraining typically uses a high learning rate and a long warmup; SFT runs at a much lower learning rate for far fewer steps; preference alignment uses smaller batches and tight KL constraints to keep the policy close to the reference model. Folding all of this into one optimization run would force compromises.

Fourth, distinct objectives align better than one mixed objective. RLHF rewards, next-token cross-entropy, and instruction imitation loss are not on the same scale and have different gradient noise properties. Training them in sequence rather than as a weighted sum lets each finish before the next begins.

practical guidance

A few practical points come up repeatedly in published recipes and in implementation discussions.

Save a checkpoint at the end of every stage. Later stages often regress on some metrics relative to the previous stage (alignment tax is the canonical example: SFT and RLHF can degrade base-model knowledge and calibration), so the previous checkpoint is the natural baseline.

Lower the learning rate for later stages. Pretraining typically uses peak rates around 1e-4 to 3e-4 for large models; SFT runs at 1e-5 to 1e-6; preference alignment runs lower still. Re-using the pretraining schedule for fine-tuning is a common bug that destroys learned representations.

Validate at the end of each stage on the metrics that matter for that stage. Pretraining is usually checked with held-out perplexity and a small set of zero-shot benchmarks; SFT with instruction-following evaluations like AlpacaEval or MT-Bench; alignment with preference-rate evaluations and safety probes.

Watch for stage-boundary loss spikes. The first few hundred steps of a new stage often show sharp loss jumps as the model adjusts to a new data distribution or objective. Short warmups and slightly lower initial learning rates help.

limitations

Staged training has real downsides. The most obvious is engineering complexity: each stage needs its own data pipeline, hyperparameter sweep, evaluation harness, and checkpoint management. A four-stage recipe can have ten times the moving parts of a single-stage one.

The hyperparameter space also explodes. Learning rate, batch size, warmup, weight decay, optimizer state, and data mix are usually tuned per stage. Decisions interact across stages (a longer SFT stage may demand a tighter KL during alignment, for example), and ablating these interactions on full-scale runs is expensive.

Catastrophic forgetting is another risk. Each stage can erode capabilities learned in earlier stages, which is why preference alignment typically anchors against a reference model with a KL penalty and why long-context extension is often kept short. The literature on continual learning addresses this directly, but it remains an active concern in production LLM training.

Finally, stage boundaries can introduce loss spikes and instability. Switching the data distribution or objective abruptly forces the optimizer to adapt, and large transformer models occasionally diverge during these transitions. Practitioners mitigate this with short warmups, gradient clipping, and conservative learning-rate schedules across boundaries.

references

Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18(7), 1527-1554. https://www.cs.toronto.edu/~fritz/absps/ncfast.pdf
Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2007). Greedy layer-wise training of deep networks. NeurIPS 2006. https://www.iro.umontreal.ca/~lisa/pointeurs/BengioNips2006All.pdf
Bengio, Y., Louradour, J., Collobert, R., & Weston, J. (2009). Curriculum learning. Proceedings of ICML 2009. https://ronan.collobert.com/pub/2009_curriculum_icml.pdf
Kumar, M. P., Packer, B., & Koller, D. (2010). Self-paced learning for latent variable models. NeurIPS 2010.
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. NeurIPS 2015. https://arxiv.org/abs/1506.01497
Christiano, P., Leike, J., Brown, T., Martic, M., Legg, S., & Amodei, D. (2017). Deep reinforcement learning from human preferences. NeurIPS 2017.
Gururangan, S., Marasovic, A., Swayamdipta, S., Lo, K., Beltagy, I., Downey, D., & Smith, N. A. (2020). Don't Stop Pretraining: Adapt Language Models to Domains and Tasks. ACL 2020. https://aclanthology.org/2020.acl-main.740/
Brown, T. et al. (2020). Language Models are Few-Shot Learners. NeurIPS 2020 (GPT-3 paper).
Salimans, T. & Ho, J. (2022). Progressive Distillation for Fast Sampling of Diffusion Models. ICLR 2022.
Ouyang, L. et al. (2022). Training language models to follow instructions with human feedback. NeurIPS 2022. https://arxiv.org/abs/2203.02155
Bai, Y. et al. (2022). Constitutional AI: Harmlessness from AI Feedback. https://arxiv.org/abs/2212.08073
Touvron, H. et al. (2023). LLaMA: Open and Efficient Foundation Language Models. https://arxiv.org/abs/2302.13971
Roziere, B. et al. (2023). Code Llama: Open Foundation Models for Code. https://arxiv.org/abs/2308.12950
Song, Y., Dhariwal, P., Chen, M., & Sutskever, I. (2023). Consistency Models. ICML 2023.
Xie, S. M. et al. (2023). DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining. NeurIPS 2023. https://arxiv.org/abs/2305.10429
Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS 2023. https://arxiv.org/abs/2305.18290
Peng, B., Quesnelle, J., Fan, H., & Shippole, E. (2023). YaRN: Efficient Context Window Extension of Large Language Models. https://arxiv.org/abs/2309.00071
Llama Team, AI @ Meta (2024). The Llama 3 Herd of Models. https://arxiv.org/abs/2407.21783
DeepSeek-AI (2024). DeepSeek-V3 Technical Report. https://arxiv.org/abs/2412.19437
Abdin, M. et al. (2024). Phi-4 Technical Report. https://arxiv.org/abs/2412.08905
Penedo, G. et al. (2024). The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale. https://arxiv.org/abs/2406.17557

Staged training

historical background

modern LLM training stages

pretraining and its substages

continued pretraining and domain adaptation

supervised fine-tuning

preference alignment

tool use and agentic fine-tuning

long-context extension as a stage

staged training outside language models

why staged training works

practical guidance

limitations

see also

references

Improve this article

historical background

modern LLM training stages

pretraining and its substages

continued pretraining and domain adaptation

supervised fine-tuning

preference alignment

tool use and agentic fine-tuning

long-context extension as a stage

staged training outside language models

why staged training works

practical guidance

limitations

see also

references

related concepts

historical background

modern LLM training stages

pretraining and its substages

continued pretraining and domain adaptation

supervised fine-tuning

preference alignment

tool use and agentic fine-tuning

long-context extension as a stage

staged training outside language models

why staged training works

practical guidance

limitations

see also

references

Improve this article

Related Articles

GELU (Gaussian Error Linear Unit)

LeNet

Loss Curve

NaN Trap

Gradient Accumulation

AdaGrad

related concepts

historical background

modern LLM training stages

pretraining and its substages

continued pretraining and domain adaptation

supervised fine-tuning

preference alignment

tool use and agentic fine-tuning

long-context extension as a stage

staged training outside language models

why staged training works

practical guidance

limitations

see also

references

Related Articles

GELU (Gaussian Error Linear Unit)

LeNet

Loss Curve

NaN Trap

Gradient Accumulation

AdaGrad