# Instruction Tuning

> Source: https://aiwiki.ai/wiki/instruction_tuning
> Updated: 2026-06-21
> Categories: Deep Learning, Large Language Models, Machine Learning, Natural Language Processing
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**Instruction tuning** is the post-pretraining training stage in which a [large language model](/wiki/large_language_model) (LLM) is fine-tuned on a curated collection of (instruction, response) pairs so that it learns to follow natural-language instructions.[^1] In the words of the paper that named the technique, it is "finetuning language models on a collection of tasks described via instructions."[^2] The process transforms a raw, next-token prediction model (which by default behaves as a pattern-completing text continuation engine) into a model that interprets a user's request as a *task to be executed* and produces a helpful answer. Instruction tuning is the single step that turns a base model like a pretrained [GPT-3](/wiki/gpt3) into a usable assistant, and it is what separates a text-completion engine from products such as [ChatGPT](/wiki/chatgpt), [Claude](/wiki/claude), and [Gemini](/wiki/gemini).

## What is instruction tuning?

Instruction tuning teaches a model a single meta-skill: given any natural-language description of a task, produce the requested output. Rather than training the model on one task at a time, it is fine-tuned on hundreds or thousands of different tasks all phrased as instructions, so that at inference time it generalizes the skill of "follow the instruction" to instructions it has never seen. The defining empirical result is that a 137B-parameter model instruction-tuned on over 60 NLP tasks (FLAN) "surpasses zero-shot 175B GPT-3 on 20 of 25 tasks" without any new pretraining.[^2] A second landmark result is that a 1.3B-parameter instruction-and-preference-tuned model (InstructGPT) produced outputs that human labelers preferred to those of the 175B base GPT-3, "despite having 100x fewer parameters."[^4] Both findings established the core lesson of the field: for a model that is already large enough, *behavior* (following instructions) is a bigger lever on usefulness than raw *scale*.

The term and the modern practice were popularized by Wei et al.'s 2021 paper "Finetuned Language Models Are Zero-Shot Learners," which introduced **FLAN** and showed that fine-tuning a 137B-parameter [LaMDA](/wiki/lamda_language_model_for_dialogue_applications) model on more than 60 NLP tasks reformulated as natural-language instructions made it outperform zero-shot GPT-3 on 20 of 25 evaluation benchmarks.[^2] Concurrent work on **T0** by Sanh et al. at the BigScience Workshop reached the same conclusion using a different model family.[^3] In March 2022, OpenAI's **InstructGPT** combined instruction-style [supervised fine-tuning](/wiki/sft) (SFT) with [reinforcement learning from human feedback](/wiki/rlhf) (RLHF), establishing the SFT-then-preference-learning recipe that became the canonical post-training template for ChatGPT, Claude, [Llama](/wiki/llama) Chat, [Gemma](/wiki/gemma), and most other contemporary chat assistants.[^4]

Instruction tuning is the *broad concept* of teaching a model to follow instructions; [SFT](/wiki/sft) is the *mechanism* (next-token cross-entropy on the response portion of an instruction-formatted prompt); RLHF and [DPO](/wiki/dpo) are *subsequent* preference-learning stages that are normally layered on top. Together they form what is now usually called the *post-training* pipeline. By 2026 instruction tuning is a routine, well-understood step in every production-grade chat model, with mature open-source datasets, recipes (e.g. Llama 3.1, [Tülu 3](/wiki/tulu_3)), evaluation harnesses ([MT-Bench](/wiki/mt_bench), [AlpacaEval](/wiki/alpacaeval), [IFEval](/wiki/ifeval), Arena-Hard), and parameter-efficient variants such as [LoRA](/wiki/lora) and QLoRA.

## Explain like I'm 5 (ELI5)

Imagine a parrot that has read every book in the library. It knows tons of words and facts, but if you ask it "Please summarize this story," it might just keep talking about random things instead of answering. The parrot knows the information; it just does not understand what you want it to *do*.

Instruction tuning is showing the parrot lots of examples: "When someone says 'summarize this,' here is what a good summary looks like. When someone says 'translate this,' here is what a translation looks like." After enough examples, the parrot learns the *pattern of being asked to do something and then doing it*. Now when you ask it to write a poem about the ocean, something it never practiced, it understands that you want a poem and writes one, because it has learned the general idea of "follow the instruction the human gave."

## Background

### Pretraining and the zero/few-shot baseline

A pretrained LLM such as GPT-3 is trained on a self-supervised next-token prediction objective over a large corpus of internet text.[^5] After pretraining, the model is an extraordinarily powerful *text continuation* engine, but it does not natively distinguish "the user wants me to translate this sentence" from "this is the start of a Reddit thread that happens to begin with the word 'Translate'." To extract useful work, GPT-3 era practitioners had to engineer prompts that *demonstrated* the task. **Few-shot prompting**, placing a handful of input/output exemplars in the context window before the actual query, became the dominant interface to base models.[^5] **Zero-shot prompting**, where only a task description is given, often worked far worse, since the base model lacked any explicit signal that natural-language descriptions of tasks should be acted upon rather than continued.

### Why do base models need instruction following?

Three pressures converged in 2021:

1. *Practical usability.* Base models were unfriendly to non-experts. Asking GPT-3 "Summarize this paragraph in one sentence" frequently returned more paragraphs, not a summary, unless the prompt was carefully scaffolded with few-shot exemplars.
2. *Scaling-only progress was plateauing for some abilities.* Wei et al. argued that even GPT-3-scale models did not reliably *follow* instructions zero-shot, and that this was a behavioral, not a capability, gap.[^2]
3. *Alignment.* A model trained purely to predict the next token has no obligation to be helpful, truthful, or harmless. OpenAI framed instruction tuning explicitly as an *alignment* technique to make models do what users actually want.[^4]

Instruction tuning addressed all three by training models on a meta-task: "given any natural-language description of a task, produce the requested output." The resulting models generalize this skill to unseen instructions at inference time, a property that the FLAN authors called **zero-shot task generalization**.[^2]

## Origins

### What is FLAN? (Wei et al., 2021)

FLAN (*Finetuned Language Net*) was first posted to arXiv on 3 September 2021.[^2] Jason Wei, Maarten Bosma and colleagues at Google Research fine-tuned the 137B-parameter LaMDA-PT model on a mixture of 62 publicly available NLP datasets reformulated into a natural-language instruction format. Each dataset was associated with up to 10 manually written instruction templates ("Translate this English sentence to French: {sentence}") to diversify phrasings.

The headline result was that FLAN, evaluated *zero-shot* on a held-out cluster of tasks not seen in fine-tuning, outperformed zero-shot GPT-3 175B on 20 of 25 benchmarks and beat few-shot GPT-3 on several. Crucial ablations showed:

- *Number of fine-tuning task clusters matters.* Performance on held-out clusters scaled with the number of in-distribution clusters in the mixture.
- *Scale matters.* Smaller LaMDA variants (8B and below) did not benefit from instruction tuning; in some cases instruction tuning *hurt* them. The benefit emerged only at ~68B+ parameters.
- *Natural-language instructions matter.* Variants that replaced instructions with dataset names or no instructions performed substantially worse.

FLAN is conventionally cited as the paper that coined the phrase "instruction tuning" in its modern sense, defining it as "finetuning language models on a collection of tasks described via instructions."[^2] The follow-up *Scaling Instruction-Finetuned Language Models* paper (Chung et al., October 2022) scaled the mixture to 1,836 tasks, added [chain-of-thought](/wiki/chain_of_thought) data, and released the Flan-T5 and Flan-PaLM checkpoints; Flan-PaLM 540B reached 75.2% on five-shot [MMLU](/wiki/mmlu), state-of-the-art at the time, and Flan-T5 became a widely used open base for research.[^6]

### What is T0? (Sanh et al., 2021)

Almost in parallel, Victor Sanh, Albert Webson, Colin Raffel and the BigScience Workshop posted "Multitask Prompted Training Enables Zero-Shot Task Generalization" on 15 October 2021.[^3] Where FLAN started from LaMDA-PT, T0 started from the encoder-decoder T5 (specifically T5+LM-adapted). Where FLAN wrote ~10 templates per task, T0 leveraged **P3 (the Public Pool of Prompts)**, an open crowdsourced library that at the time of the paper contained 2,073 prompts for 177 datasets (about 11.7 prompts per dataset).[^3] T0 fine-tuned on a curated subset, holding out four task clusters for evaluation.

Despite being 11B parameters (16x smaller than GPT-3 175B), T0 matched or exceeded zero-shot GPT-3 on 9 of 11 held-out datasets. T0 also showed that *prompt diversity per task* was a key driver: using multiple prompts for the same dataset improved generalization compared to a single template, evidence that the model was learning instruction-following rather than memorizing template surface form.

FLAN and T0 are usually credited jointly as the canonical origins of large-scale instruction tuning. They diverged stylistically, with Google's manually-written templates versus BigScience's crowd-sourced P3, but agreed on the central claim: train on diverse natural-language task descriptions and the model learns to follow new ones.

### What is InstructGPT? (Ouyang et al., 2022): the SFT + RLHF combo

OpenAI's "Training language models to follow instructions with human feedback," posted 4 March 2022 (arXiv:2203.02155), introduced **InstructGPT** and welded instruction tuning together with preference learning into a three-stage pipeline that has shaped post-training ever since.[^4] The paper frames its goal as "aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback."[^4]

1. **Supervised fine-tuning (SFT)** on ~13,000 prompt/response pairs written by 40 contracted human labelers. The prompts were sourced both from labelers and from the public OpenAI API logs of GPT-3.
2. **Reward model (RM) training** on ~33,000 prompts where labelers ranked four-to-nine sampled outputs from best to worst. A separate model was trained to predict these preferences.
3. **Reinforcement learning** via [Proximal Policy Optimization](/wiki/ppo) (PPO) against the reward model, with a KL penalty to the SFT model, on ~31,000 prompts.

The most influential empirical finding was that human evaluators preferred the 1.3B InstructGPT outputs to those of the 175B base GPT-3, despite a 100x parameter gap; the abstract states that "outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters."[^4] InstructGPT also showed that "InstructGPT models show improvements in truthfulness and reductions in toxic output generation," with measurable gains on TruthfulQA and reductions in toxic and biased generations.[^4] The paper's section on alignment framing (helpful, honest, harmless) became influential in its own right.

Crucially, InstructGPT did not invent any single component (SFT was old; RLHF had appeared in the InstructGPT authors' earlier work on summarization and in Anthropic's pre-Claude research). What it did was demonstrate that the *combination*, instruction-style SFT followed by preference-based RL, yielded a dramatically better assistant than any single ingredient. This recipe was the basis of ChatGPT, released eight months later, and has remained the default template through 2026, with later models substituting [DPO](/wiki/dpo), rejection sampling, or hybrid schemes for the PPO stage but keeping the underlying SFT-on-instructions step intact.

## How does instruction tuning work?

### Data format

The dominant data format, popularized by FLAN and rigidified by Stanford Alpaca, treats each training example as a **(instruction, optional input, output)** triplet rendered into a chat-style prompt.[^7] The Alpaca template, used by hundreds of subsequent open-source projects, looks like:

```
Below is an instruction that describes a task, paired with an input
that provides further context. Write a response that appropriately
completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:
{output}
```

When the task is self-contained the *Input* field is dropped. Modern chat-trained models (Llama-3-Instruct, Mistral-Instruct, Qwen, Gemma) replace this scaffolding with a **chat template**: special tokens marking system, user, and assistant turns. Functionally equivalent, but chat templates make multi-turn data trivial to express:

```
<|system|>You are a helpful assistant.
<|user|>Translate to French: The cat sat on the mat.
<|assistant|>Le chat s'est assis sur le tapis.
```

### Training objective: loss on response tokens only

Instruction tuning uses the same next-token [cross-entropy](/wiki/cross-entropy) loss as pretraining, but with a critical difference: the loss is computed *only on the response tokens*, with the instruction and input tokens **masked out**. This matters for two reasons. First, learning to *generate* instructions is not the goal; the model only needs to *condition on* them. Second, applying loss to the instruction would encourage the model to memorize the prompt distribution rather than learn the task. Hugging Face's TRL `SFTTrainer`, Axolotl, Llama Recipes, and most other open-source instruction-tuning libraries default to this response-only masking.[^8]

Beyond loss masking, the training recipe is unsurprising: AdamW optimizer, modest learning rates (1e-5 to 2e-5 for full fine-tuning; ~1e-4 for LoRA), 1-3 epochs, packing of short sequences, and standard transformer training infrastructure. The simplicity of the recipe is one reason instruction tuning has spread so quickly.

### Position in the pipeline

In the canonical post-training pipeline,

```
pretraining -> SFT (instruction tuning) -> preference learning (RLHF / DPO / ...) -> RL on verifiable rewards (optional) -> deployment
```

instruction tuning sits immediately after pretraining and immediately before preference learning. The SFT stage initializes the model's response style and core behaviors; preference learning then refines them with respect to subjective qualities like helpfulness, conciseness, and safety that are hard to capture in demonstration data alone.[^4]

It is possible to do SFT-only without preference learning (most older open models in 2023, e.g. Vicuna, did this), and possible (though unusual) to skip SFT and apply preference learning directly from the base model. In practice both stages contribute, and most production recipes use both.

## Synthetic instruction data

A central practical problem in early instruction tuning was *cost*: high-quality human-written instruction data is expensive. Several lines of research showed that LLMs themselves could be used to generate or augment instruction data, producing surprisingly capable open models at a tiny fraction of human-labor cost.

### Self-Instruct

Wang et al.'s **Self-Instruct** (arXiv:2212.10560, December 2022) introduced a near-fully automated pipeline that bootstraps instruction data from an LLM's own outputs.[^9] Starting from 175 manually written seed tasks, the pipeline:

1. Samples seed tasks and prompts the LLM to generate new task instructions in the same format.
2. Classifies whether each new instruction needs an input field.
3. Prompts the LLM to fill in input(s) and output(s) for each new instruction.
4. Filters duplicates and low-quality items using ROUGE-L similarity and heuristics.
5. Adds the survivors back to the pool and iterates.

Applied to vanilla GPT-3 (`davinci`), Self-Instruct generated 52,000 instructions and yielded a 33-point absolute improvement on Super-NaturalInstructions, putting the model "on par with the performance of InstructGPT-001," at near-zero marginal cost beyond API usage.[^9]

### Alpaca

Stanford CRFM's **Alpaca**, released 13 March 2023, applied the Self-Instruct pipeline to `text-davinci-003` and used the resulting 52K-example dataset to instruction-tune Meta's [LLaMA](/wiki/llama) 7B for a total compute and API spend of under $600.[^7] Stanford reports that the data generation alone "costed less than $500 using the OpenAI API," with fine-tuning adding "less than $100 on most cloud compute providers."[^7] Alpaca's outputs were judged comparable to `text-davinci-003` on a 252-prompt evaluation set authored by the Self-Instruct team. Alpaca's enormous influence stemmed not from technical novelty but from demonstrating that *anyone* with a few GPUs could fine-tune a small instruction-following model, single-handedly catalyzing the 2023 explosion of open instruction-tuned LLaMA derivatives.

### Vicuna, WizardLM, Dolly, OpenAssistant

Within weeks of Alpaca, a series of related open instruction-tuned models followed:

- **[Vicuna](/wiki/vicuna)** (LMSYS, UC Berkeley, March 2023): LLaMA 7B/13B fine-tuned on ~70K real user-ChatGPT dialogues collected from ShareGPT, a now-defunct browser extension that let users share their ChatGPT sessions. Vicuna was judged by GPT-4 to reach ~90% of ChatGPT quality on the team's evaluation, and the GPT-4-as-judge methodology in their report directly preceded MT-Bench.[^10]
- **WizardLM** (Xu et al., April 2023, arXiv:2304.12244): introduced *[Evol-Instruct](/wiki/evol_instruct)*, an LLM-driven procedure for progressively rewriting seed instructions into harder, deeper, or broader variants.[^11]
- **Dolly v2** ([Databricks](/wiki/databricks), April 2023): 15K human-written instructions ("databricks-dolly-15k") covering seven task categories, released under a commercial-friendly license to provide an alternative to OpenAI-distilled data.
- **OpenAssistant** (LAION et al., April 2023): a multilingual, crowdsourced multi-turn conversation dataset (OASST1, ~161K messages across 35 languages) released publicly along with trained chat models.[^12]

### UltraChat, OpenHermes, Orca

A second wave from late 2023 onward focused on *scale* and *quality* of synthetic data:

- **UltraChat** (Ding et al., April 2023): 1.5 million multi-turn dialogues generated by having two ChatGPT instances chat with each other across systematically varied topics.[^13]
- **Orca** (Mukherjee et al., June 2023, arXiv:2306.02707): distilled responses from GPT-4 that included *explanation traces*, system instructions, and step-by-step reasoning rather than just answers, aimed at giving smaller students access to the "thinking" of the larger teacher.[^14]
- **OpenHermes / Nous Hermes** (Nous Research): community-curated mixtures combining filtered subsets of Alpaca, GPTeacher, Camel, ShareGPT, and other sources; widely used as the SFT mix for many open chat models.
- **ShareGPT, ChatBot-Arena conversations, WildChat**: corpora of *real* human-chatbot conversations, valued because they capture authentic user phrasing rather than the synthetic distribution generated by Self-Instruct-style pipelines.

The general trend has been from raw quantity toward *carefully filtered*, *diverse*, and *complex* mixtures, with empirical results suggesting that, beyond a certain quality bar, larger amounts of bad data hurt more than they help.

## Multi-turn and tool-use instruction tuning

Early instruction tuning datasets were almost exclusively single-turn: one instruction, one response. Production chat assistants must handle long multi-turn conversations where context, references ("the previous answer"), and follow-up questions abound. Modern instruction-tuning mixes therefore deliberately include:

- **Multi-turn conversations**: UltraChat, ShareGPT, OASST, WildChat, and the LMSYS-1M / WildBench corpora. The chat template explicitly marks user and assistant turns so that loss is computed only on assistant turns across the whole conversation.
- **Tool-use / function-calling traces**: instruction examples whose responses are not natural language but structured calls to external tools, with subsequent assistant turns conditioned on tool outputs. The **ToolBench** dataset (Qin et al., 2023) introduced ~16,000 real-world APIs with synthetic tool-use trajectories.[^15] **Gorilla** (Patil et al.) focused on API-call generation. By 2026 every frontier chat model includes substantial [tool-use](/wiki/tool_use) examples in its instruction-tuning mix, and frontier API products such as OpenAI's *function calling* and Anthropic's *tool use* are surface-level manifestations of this training data.
- **Agent trajectories**: ReAct-style traces interleaving reasoning, action, and observation are used to teach LLMs to act over multiple turns in browser, terminal, or game-playing environments.

## Domain-specific instruction tuning

Instruction tuning is also widely used to specialize general LLMs to specific domains. Representative examples:

- **Code**: WizardCoder, Magicoder, [Code Llama](/wiki/code_llama)-Instruct, OctoPack/OctoCoder, DeepSeek-Coder-Instruct, Qwen2.5-Coder-Instruct. Instruction mixes here typically combine code-completion problems, code-explanation tasks, debugging dialogues, and synthetic code-instruction data such as Code Alpaca.
- **Mathematical reasoning**: MetaMath, WizardMath, MathInstruct, Mammoth, Skywork-MathQA, NuminaMath. Many of these use *rejection sampling on verifiable answers*: generate many candidate solutions, keep only those whose final answer matches the gold answer, to bootstrap reasoning-heavy instruction data.
- **Medical**: Med-PaLM and Med-PaLM 2 (Google), Med42, MedAlpaca, ChatDoctor, BioMistral. Typically combine licensed medical Q&A, board-exam corpora, and clinician-written demonstrations.
- **Legal, finance, scientific writing**: LawGPT, FinGPT, Galactica-derived science assistants, built by mixing domain corpora into the SFT stage.

A consistent finding is that *some* general instruction data should remain in the mix even when specializing; pure-domain SFT often degrades the model's general conversational abilities.

## Modern recipes

By 2025-26 a number of well-documented modern recipes had crystallized.

### Llama 3 / 3.1 / 3.3

Meta's Llama-3 post-training report described a multi-round pipeline that interleaves SFT, **rejection sampling** (RS), [DPO](/wiki/dpo), and PPO across six rounds, with each round generating new synthetic preference and instruction data against the previous-round checkpoint.[^16] The SFT stage in particular used ~10M examples spanning ~25 instruction-following categories, with significant emphasis on filtering and on adding tool-use, multilingual, and reasoning data. Llama 3.1 and 3.3 retained this template, adjusting mixture composition and adding longer-context instruction data.

### Tülu 3

Allen AI's [**Tülu 3**](/wiki/tulu_3) (Lambert et al., 2024, with continued releases into 2025) was released as a fully open recipe (datasets, training code, model weights) encompassing SFT, DPO, and a final RL stage using *verifiable rewards* (RLVR). Its SFT mixture combined human-written, persona-driven synthetic, and math/code instruction data, all curated and decontaminated against the evaluation set.[^17] Tülu 3 is widely cited in 2026 as the canonical fully-open reference for modern instruction tuning.

### OpenChat, Zephyr, NeuralChat, Hermes 3

Smaller community projects iterated on the recipe at modest cost:

- **OpenChat** (Wang et al.): used "C-RLFT" (conditioned RL fine-tuning) to learn from mixed-quality data labeled by source.
- **Zephyr** (HuggingFace, Tunstall et al.): demonstrated that DPO on UltraFeedback, on top of SFT on UltraChat, could reach competitive MT-Bench scores at the 7B scale.
- **Hermes 3** (Nous Research, 2024): a heavily community-curated SFT mix on top of Llama 3, emphasizing steerability and reduced refusal.

## How is instruction following evaluated?

Evaluating instruction-following is intrinsically hard: there is rarely a single "correct" answer, and benchmarks designed for the few-shot pretrained era (MMLU, HellaSwag, GSM8K) measure capability more than instruction-following. Specialized harnesses appeared:

| Benchmark | Size | Judge | What it measures |
|---|---|---|---|
| [MT-Bench](/wiki/mt_bench) (LMSYS, 2023) | 80 multi-turn questions, 8 categories | GPT-4, 1-10 scale | Multi-turn chat quality; introduced GPT-4-as-judge for chat models.[^10] |
| [AlpacaEval](/wiki/alpacaeval) / 2.0 | 805 instructions | LLM judge, head-to-head win-rate | Helpfulness vs a reference model; 2.0 added length-controlled scoring.[^18] |
| [IFEval](/wiki/ifeval) (Google, 2023) | 541 prompts, 25 verifiable instruction types | Programmatic (no LLM judge) | Formal, *verifiable* constraint-following ("exactly 3 paragraphs, keyword X twice").[^19] |
| Arena-Hard (LMSYS) | 500 challenging real-user prompts | GPT-4, head-to-head | Hard real-world prompts mined from Chatbot Arena. |
| [Chatbot Arena](/wiki/lmsys_chatbot_arena) | Open, millions of votes | Human pairwise votes -> Elo | Realistic head-to-head ranking of deployed instruction-tuned models. |

None of these is fully decisive, as LLM judges have known biases (length, formatting, position) and humans disagree, so most teams report several together.

## Relationship to neighboring concepts

| Concept | Relationship to instruction tuning |
|---|---|
| [Fine-tuning](/wiki/fine_tuning) | Instruction tuning is a *kind* of fine-tuning: the multi-task, instruction-formatted kind. |
| [SFT](/wiki/sft) | The training mechanism (supervised next-token loss on responses). Instruction tuning is the *use* of SFT for instruction-following data. |
| [RLHF](/wiki/rlhf) | A *subsequent* stage after instruction tuning, using preferences instead of demonstrations. |
| [DPO](/wiki/dpo) | A simpler alternative to RLHF's reward-model + PPO pipeline; same role in the stack. |
| [Chain-of-thought](/wiki/chain_of_thought) | Modern instruction-tuning data often includes CoT reasoning to teach explicit step-by-step responses. |
| [Tool use](/wiki/tool_use) | Implemented largely *through* instruction tuning on tool-call traces. |
| Alignment | Instruction tuning is one of several alignment techniques; preference learning is another. |
| Prompt engineering | Largely *replaces* the need for elaborate prompt engineering on instruction-tuned models. |

## How does instruction tuning differ from RLHF?

Instruction tuning (via SFT) and [RLHF](/wiki/rlhf) are sequential, complementary stages, not competitors. Instruction tuning learns from *demonstrations*: a human (or a teacher model) writes the ideal response, and the model is trained with cross-entropy to reproduce it. RLHF learns from *comparisons*: humans rank model-sampled outputs, a reward model is fit to those rankings, and the policy is optimized to score highly. The practical difference is that SFT can only teach behaviors for which someone wrote an example, while preference learning can push the model toward qualities (concision, tone, refusing harmful requests, avoiding hedging) that are easy to recognize but tedious to demonstrate. InstructGPT showed the two together beat either alone, which is why the standard 2026 recipe is SFT first, then a preference stage ([RLHF](/wiki/rlhf), [DPO](/wiki/dpo), or RLVR).[^4]

## Limitations

Instruction tuning is not a panacea. Known limitations include:

- **Superficial alignment / style mimicry.** The LIMA paper (Zhou et al., 2023, arXiv:2305.11206) argued that instruction tuning teaches the *format and style* of helpful responses rather than new capability; 1,000 carefully curated examples were enough for LLaMA to reach near-GPT-4 quality on their evaluation.[^20] Subsequent work has both supported and pushed back on this superficial-alignment hypothesis.
- **Hallucinations inherited from data.** Synthetic instruction data generated by a teacher model inherits that teacher's confabulations. Students trained on such data can become *more* fluent at producing plausible-sounding but incorrect content.
- **Distribution shift from real users.** Models trained on Self-Instruct/Alpaca-style data tend to perform better on prompts that resemble that synthetic distribution than on the messy, ambiguous prompts real users actually submit.
- **Length and verbosity bias.** Instruction-tuning data tends to skew toward longer, more elaborate responses; models then learn to be verbose regardless of query, a bias amplified by length-biased judge LLMs (AlpacaEval 2.0's length-controlled scoring was an explicit response).
- **Catastrophic forgetting.** Aggressive instruction tuning, especially on narrow domains or for too many epochs, can degrade pretraining knowledge. Mixing some generic data and limiting epoch count helps.
- **Refusal training side effects.** Safety-oriented SFT data teaches models to refuse harmful requests, but if not carefully balanced it bleeds into refusing benign requests too ("over-refusal"), or paradoxically into being more easily jailbroken because the model has learned to reason about which requests are refusable.
- **Benchmark contamination.** Many open instruction-tuning mixtures inadvertently contain MMLU/GSM8K/HumanEval items, leading to inflated benchmark numbers that do not reflect real generalization. Decontamination has become a routine but imperfect step.

## See also

- [Fine-tuning](/wiki/fine_tuning)
- [Supervised fine-tuning (SFT)](/wiki/sft)
- [RLHF](/wiki/rlhf)
- [DPO](/wiki/dpo)
- [PPO](/wiki/ppo)
- [InstructGPT](/wiki/instructgpt)
- [Vicuna](/wiki/vicuna)
- [Tülu 3](/wiki/tulu_3)
- [LoRA](/wiki/lora)
- [PEFT](/wiki/peft)
- [Chain-of-thought](/wiki/chain_of_thought)
- [Tool use](/wiki/tool_use)
- [MT-Bench](/wiki/mt_bench)
- [AlpacaEval](/wiki/alpacaeval)
- [IFEval](/wiki/ifeval)
- [Chatbot Arena](/wiki/lmsys_chatbot_arena)

## References

[^1]: Zhang, S. *et al.* (2024). "Instruction Tuning for Large Language Models: A Survey." *ACM Computing Surveys*. https://arxiv.org/abs/2308.10792

[^2]: Wei, J., Bosma, M., Zhao, V. Y., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M., & Le, Q. V. (2021/2022). "Finetuned Language Models Are Zero-Shot Learners." *ICLR 2022.* https://arxiv.org/abs/2109.01652

[^3]: Sanh, V., Webson, A., Raffel, C., Bach, S. H., Sutawika, L., Alyafeai, Z., *et al.* (2021/2022). "Multitask Prompted Training Enables Zero-Shot Task Generalization." *ICLR 2022.* https://arxiv.org/abs/2110.08207

[^4]: Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., *et al.* (2022). "Training language models to follow instructions with human feedback." *NeurIPS 2022.* https://arxiv.org/abs/2203.02155

[^5]: Brown, T. B., Mann, B., Ryder, N., Subbiah, M., *et al.* (2020). "Language Models are Few-Shot Learners." *NeurIPS 2020.* https://arxiv.org/abs/2005.14165

[^6]: Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., *et al.* (2022/2024). "Scaling Instruction-Finetuned Language Models." *JMLR.* https://arxiv.org/abs/2210.11416

[^7]: Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., & Hashimoto, T. B. (2023). "Stanford Alpaca: An Instruction-following LLaMA Model." Stanford CRFM. https://crfm.stanford.edu/2023/03/13/alpaca.html

[^8]: von Werra, L. *et al.* "TRL: Transformer Reinforcement Learning." Hugging Face. https://huggingface.co/docs/trl/sft_trainer

[^9]: Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N. A., Khashabi, D., & Hajishirzi, H. (2022/2023). "Self-Instruct: Aligning Language Models with Self-Generated Instructions." *ACL 2023.* https://arxiv.org/abs/2212.10560

[^10]: Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., *et al.* (2023). "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." *NeurIPS 2023.* https://arxiv.org/abs/2306.05685

[^11]: Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., & Jiang, D. (2023). "WizardLM: Empowering Large Language Models to Follow Complex Instructions." https://arxiv.org/abs/2304.12244

[^12]: Köpf, A., Kilcher, Y., von Rütte, D., *et al.* (2023). "OpenAssistant Conversations: Democratizing Large Language Model Alignment." *NeurIPS 2023 Datasets and Benchmarks.* https://arxiv.org/abs/2304.07327

[^13]: Ding, N., Chen, Y., Xu, B., Qin, Y., Hu, S., Liu, Z., Sun, M., & Zhou, B. (2023). "Enhancing Chat Language Models by Scaling High-quality Instructional Conversations." https://arxiv.org/abs/2305.14233

[^14]: Mukherjee, S., Mitra, A., Jawahar, G., Agarwal, S., Palangi, H., & Awadallah, A. (2023). "Orca: Progressive Learning from Complex Explanation Traces of GPT-4." https://arxiv.org/abs/2306.02707

[^15]: Qin, Y., Liang, S., Ye, Y., Zhu, K., Yan, L., Lu, Y., *et al.* (2023). "ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs." https://arxiv.org/abs/2307.16789

[^16]: Grattafiori, A., Dubey, A., Jauhri, A., *et al.* (2024). "The Llama 3 Herd of Models." https://arxiv.org/abs/2407.21783

[^17]: Lambert, N., Morrison, J., Pyatkin, V., Huang, S., Ivison, H., Brahman, F., *et al.* (2024/2025). "Tülu 3: Pushing Frontiers in Open Language Model Post-Training." Allen Institute for AI. https://arxiv.org/abs/2411.15124

[^18]: Dubois, Y., Galambosi, B., Liang, P., & Hashimoto, T. (2024). "Length-Controlled AlpacaEval." https://arxiv.org/abs/2404.04475

[^19]: Zhou, J., Lu, T., Mishra, S., Brahma, S., Basu, S., Luan, Y., Zhou, D., & Hou, L. (2023). "Instruction-Following Evaluation for Large Language Models." https://arxiv.org/abs/2311.07911

[^20]: Zhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., Mao, Y., *et al.* (2023). "LIMA: Less Is More for Alignment." *NeurIPS 2023.* https://arxiv.org/abs/2305.11206

