Instruction Tuning
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v6 · 4,174 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v6 · 4,174 words
Add missing citations, update stale details, or suggest a clearer explanation.
Instruction tuning is the post-pretraining training stage in which a large language model (LLM) is taught to follow natural-language instructions by fine-tuning it on a curated collection of (instruction, response) pairs.[1] The technique transforms a raw, next-token prediction model (which by default behaves as a pattern-completing text continuation engine) into a model that interprets a user's request as a task to be executed and produces a helpful answer.
The term and the modern practice were popularized by Wei et al.'s 2021 paper "Finetuned Language Models Are Zero-Shot Learners," which introduced FLAN and showed that fine-tuning a 137B-parameter LaMDA model on more than 60 NLP tasks reformulated as natural-language instructions made it outperform zero-shot GPT-3 on 20 of 25 evaluation benchmarks.[2] Concurrent work on T0 by Sanh et al. at the BigScience Workshop reached the same conclusion using a different model family.[3] In March 2022, OpenAI's InstructGPT combined instruction-style supervised fine-tuning (SFT) with reinforcement learning from human feedback (RLHF), establishing the SFT-then-preference-learning recipe that became the canonical post-training template for ChatGPT, Claude, Llama Chat, Gemma, and most other contemporary chat assistants.[4]
Instruction tuning is the broad concept of teaching a model to follow instructions; SFT is the mechanism (next-token cross-entropy on the response portion of an instruction-formatted prompt); RLHF and DPO are subsequent preference-learning stages that are normally layered on top. Together they form what is now usually called the post-training pipeline. By 2026 instruction tuning is a routine, well-understood step in every production-grade chat model, with mature open-source datasets, recipes (e.g. Llama 3.1, Tülu 3), evaluation harnesses (MT-Bench, AlpacaEval, IFEval, Arena-Hard), and parameter-efficient variants such as LoRA and QLoRA.
Imagine a parrot that has read every book in the library. It knows tons of words and facts, but if you ask it "Please summarize this story," it might just keep talking about random things instead of answering. The parrot knows the information; it just does not understand what you want it to do.
Instruction tuning is showing the parrot lots of examples: "When someone says 'summarize this,' here is what a good summary looks like. When someone says 'translate this,' here is what a translation looks like." After enough examples, the parrot learns the pattern of being asked to do something and then doing it. Now when you ask it to write a poem about the ocean, something it never practiced, it understands that you want a poem and writes one, because it has learned the general idea of "follow the instruction the human gave."
A pretrained LLM such as GPT-3 is trained on a self-supervised next-token prediction objective over a large corpus of internet text.[5] After pretraining, the model is an extraordinarily powerful text continuation engine, but it does not natively distinguish "the user wants me to translate this sentence" from "this is the start of a Reddit thread that happens to begin with the word 'Translate'." To extract useful work, GPT-3 era practitioners had to engineer prompts that demonstrated the task. Few-shot prompting, placing a handful of input/output exemplars in the context window before the actual query, became the dominant interface to base models.[5] Zero-shot prompting, where only a task description is given, often worked far worse, since the base model lacked any explicit signal that natural-language descriptions of tasks should be acted upon rather than continued.
Three pressures converged in 2021:
Instruction tuning addressed all three by training models on a meta-task: "given any natural-language description of a task, produce the requested output." The resulting models generalize this skill to unseen instructions at inference time, a property that the FLAN authors called zero-shot task generalization.[2]
FLAN (Finetuned Language Net) was first posted to arXiv on 3 September 2021.[2] Jason Wei, Maarten Bosma and colleagues at Google Research fine-tuned the 137B-parameter LaMDA-PT model on a mixture of 62 publicly available NLP datasets reformulated into a natural-language instruction format. Each dataset was associated with up to 10 manually written instruction templates ("Translate this English sentence to French: {sentence}") to diversify phrasings.
The headline result was that FLAN, evaluated zero-shot on a held-out cluster of tasks not seen in fine-tuning, outperformed zero-shot GPT-3 175B on 20 of 25 benchmarks and beat few-shot GPT-3 on several. Crucial ablations showed:
FLAN is conventionally cited as the paper that coined the phrase "instruction tuning" in its modern sense.[2] The follow-up Scaling Instruction-Finetuned Language Models paper (Chung et al., October 2022) scaled the mixture to 1,836 tasks, added chain-of-thought data, and released the Flan-T5 and Flan-PaLM checkpoints; Flan-PaLM 540B reached 75.2% on five-shot MMLU, state-of-the-art at the time, and Flan-T5 became a widely used open base for research.[6]
Almost in parallel, Victor Sanh, Albert Webson, Colin Raffel and the BigScience Workshop posted "Multitask Prompted Training Enables Zero-Shot Task Generalization" on 15 October 2021.[3] Where FLAN started from LaMDA-PT, T0 started from the encoder-decoder T5 (specifically T5+LM-adapted). Where FLAN wrote ~10 templates per task, T0 leveraged P3 (the Public Pool of Prompts), an open crowdsourced library of 2,052 prompt templates across 177 datasets. T0 fine-tuned on a curated subset, holding out four task clusters for evaluation.
Despite being 11B parameters (16x smaller than GPT-3 175B), T0 matched or exceeded zero-shot GPT-3 on 9 of 11 held-out datasets. T0 also showed that prompt diversity per task was a key driver: using multiple prompts for the same dataset improved generalization compared to a single template, evidence that the model was learning instruction-following rather than memorizing template surface form.
FLAN and T0 are usually credited jointly as the canonical origins of large-scale instruction tuning. They diverged stylistically, with Google's manually-written templates versus BigScience's crowd-sourced P3, but agreed on the central claim: train on diverse natural-language task descriptions and the model learns to follow new ones.
OpenAI's "Training language models to follow instructions with human feedback," posted 4 March 2022 (arXiv:2203.02155), introduced InstructGPT and welded instruction tuning together with preference learning into a three-stage pipeline that has shaped post-training ever since.[4]
The most influential empirical finding was that human evaluators preferred the 1.3B InstructGPT outputs to those of the 175B base GPT-3, despite a 100x parameter gap. InstructGPT also showed measurable improvements in truthfulness on TruthfulQA and reductions in toxic and biased generations. The paper's section on alignment framing (helpful, honest, harmless) became influential in its own right.
Crucially, InstructGPT did not invent any single component (SFT was old; RLHF had appeared in the InstructGPT authors' earlier work on summarization and in Anthropic's pre-Claude research). What it did was demonstrate that the combination, instruction-style SFT followed by preference-based RL, yielded a dramatically better assistant than any single ingredient. This recipe was the basis of ChatGPT, released eight months later, and has remained the default template through 2026, with later models substituting DPO, rejection sampling, or hybrid schemes for the PPO stage but keeping the underlying SFT-on-instructions step intact.
The dominant data format, popularized by FLAN and rigidified by Stanford Alpaca, treats each training example as a (instruction, optional input, output) triplet rendered into a chat-style prompt.[7] The Alpaca template, used by hundreds of subsequent open-source projects, looks like:
Below is an instruction that describes a task, paired with an input
that provides further context. Write a response that appropriately
completes the request.
### Instruction:
{instruction}
### Input:
{input}
### Response:
{output}
When the task is self-contained the Input field is dropped. Modern chat-trained models (Llama-3-Instruct, Mistral-Instruct, Qwen, Gemma) replace this scaffolding with a chat template: special tokens marking system, user, and assistant turns. Functionally equivalent, but chat templates make multi-turn data trivial to express:
<|system|>You are a helpful assistant.
<|user|>Translate to French: The cat sat on the mat.
<|assistant|>Le chat s'est assis sur le tapis.
Instruction tuning uses the same next-token cross-entropy loss as pretraining, but with a critical difference: the loss is computed only on the response tokens, with the instruction and input tokens masked out. This matters for two reasons. First, learning to generate instructions is not the goal; the model only needs to condition on them. Second, applying loss to the instruction would encourage the model to memorize the prompt distribution rather than learn the task. Hugging Face's TRL SFTTrainer, Axolotl, Llama Recipes, and most other open-source instruction-tuning libraries default to this response-only masking.[8]
Beyond loss masking, the training recipe is unsurprising: AdamW optimizer, modest learning rates (1e-5 to 2e-5 for full fine-tuning; ~1e-4 for LoRA), 1-3 epochs, packing of short sequences, and standard transformer training infrastructure. The simplicity of the recipe is one reason instruction tuning has spread so quickly.
In the canonical post-training pipeline,
pretraining -> SFT (instruction tuning) -> preference learning (RLHF / DPO / ...) -> RL on verifiable rewards (optional) -> deployment
instruction tuning sits immediately after pretraining and immediately before preference learning. The SFT stage initializes the model's response style and core behaviors; preference learning then refines them with respect to subjective qualities like helpfulness, conciseness, and safety that are hard to capture in demonstration data alone.[4]
It is possible to do SFT-only without preference learning (most older open models in 2023, e.g. Vicuna, did this), and possible (though unusual) to skip SFT and apply preference learning directly from the base model. In practice both stages contribute, and most production recipes use both.
A central practical problem in early instruction tuning was cost: high-quality human-written instruction data is expensive. Several lines of research showed that LLMs themselves could be used to generate or augment instruction data, producing surprisingly capable open models at a tiny fraction of human-labor cost.
Wang et al.'s Self-Instruct (arXiv:2212.10560, December 2022) introduced a near-fully automated pipeline that bootstraps instruction data from an LLM's own outputs.[9] Starting from 175 manually written seed tasks, the pipeline:
Applied to vanilla GPT-3 (davinci), Self-Instruct generated 52,000 instructions and yielded a 33-point absolute improvement on Super-NaturalInstructions, closing the gap to InstructGPT-001 to within 5 points, at near-zero marginal cost beyond API usage.
Stanford CRFM's Alpaca, released 13 March 2023, applied the Self-Instruct pipeline to text-davinci-003 and used the resulting 52K-example dataset to instruction-tune Meta's LLaMA 7B for a total compute and API spend of under $600.[7] Alpaca's outputs were judged comparable to text-davinci-003 on a 252-prompt evaluation set authored by the Self-Instruct team. Alpaca's enormous influence stemmed not from technical novelty but from demonstrating that anyone with a few GPUs could fine-tune a small instruction-following model, single-handedly catalyzing the 2023 explosion of open instruction-tuned LLaMA derivatives.
Within weeks of Alpaca, a series of related open instruction-tuned models followed:
A second wave from late 2023 onward focused on scale and quality of synthetic data:
The general trend has been from raw quantity toward carefully filtered, diverse, and complex mixtures, with empirical results suggesting that, beyond a certain quality bar, larger amounts of bad data hurt more than they help.
Early instruction tuning datasets were almost exclusively single-turn: one instruction, one response. Production chat assistants must handle long multi-turn conversations where context, references ("the previous answer"), and follow-up questions abound. Modern instruction-tuning mixes therefore deliberately include:
Instruction tuning is also widely used to specialize general LLMs to specific domains. Representative examples:
A consistent finding is that some general instruction data should remain in the mix even when specializing; pure-domain SFT often degrades the model's general conversational abilities.
By 2025-26 a number of well-documented modern recipes had crystallized.
Meta's Llama-3 post-training report described a multi-round pipeline that interleaves SFT, rejection sampling (RS), DPO, and PPO across six rounds, with each round generating new synthetic preference and instruction data against the previous-round checkpoint.[16] The SFT stage in particular used ~10M examples spanning ~25 instruction-following categories, with significant emphasis on filtering and on adding tool-use, multilingual, and reasoning data. Llama 3.1 and 3.3 retained this template, adjusting mixture composition and adding longer-context instruction data.
Allen AI's Tülu 3 (Lambert et al., 2024, with continued releases into 2025) was released as a fully open recipe (datasets, training code, model weights) encompassing SFT, DPO, and a final RL stage using verifiable rewards (RLVR). Its SFT mixture combined human-written, persona-driven synthetic, and math/code instruction data, all curated and decontaminated against the evaluation set.[17] Tülu 3 is widely cited in 2026 as the canonical fully-open reference for modern instruction tuning.
Smaller community projects iterated on the recipe at modest cost:
Evaluating instruction-following is intrinsically hard: there is rarely a single "correct" answer, and benchmarks designed for the few-shot pretrained era (MMLU, HellaSwag, GSM8K) measure capability more than instruction-following. Specialized harnesses appeared:
None of these is fully decisive, as LLM judges have known biases (length, formatting, position) and humans disagree, so most teams report several together.
| Concept | Relationship to instruction tuning |
|---|---|
| Fine-tuning | Instruction tuning is a kind of fine-tuning: the multi-task, instruction-formatted kind. |
| SFT | The training mechanism (supervised next-token loss on responses). Instruction tuning is the use of SFT for instruction-following data. |
| RLHF | A subsequent stage after instruction tuning, using preferences instead of demonstrations. |
| DPO | A simpler alternative to RLHF's reward-model + PPO pipeline; same role in the stack. |
| Chain-of-thought | Modern instruction-tuning data often includes CoT reasoning to teach explicit step-by-step responses. |
| Tool use | Implemented largely through instruction tuning on tool-call traces. |
| Alignment | Instruction tuning is one of several alignment techniques; preference learning is another. |
| Prompt engineering | Largely replaces the need for elaborate prompt engineering on instruction-tuned models. |
Instruction tuning is not a panacea. Known limitations include: