Instruction Tuning

Instruction tuning is a supervised learning technique that fine-tunes a pretrained large language model (LLM) on a dataset of (instruction, output) pairs so the model learns to follow natural language instructions across a wide range of tasks.^[1] By exposing the model to diverse task descriptions phrased in ordinary language, instruction tuning bridges the gap between a language model's pretraining objective (next-token prediction) and the practical goal of having the model respond helpfully and accurately to user requests.^[2]

The concept was introduced independently by several research groups in 2021 and 2022. Wei et al. (2022) proposed FLAN, which instruction-tuned a 137B-parameter LaMDA model on over 60 NLP tasks and showed that the resulting model outperformed zero-shot GPT-3 on 20 out of 25 evaluation benchmarks.^[3] Around the same time, Sanh et al. (2022) introduced T0, demonstrating that multitask prompted training on a collection of tasks with diverse prompt templates enabled zero-shot generalization that matched or exceeded GPT-3 while being 16 times smaller.^[4] These findings established instruction tuning as a practical and effective method for improving the zero-shot and few-shot capabilities of language models without changing their architecture.

Instruction tuning has since become a standard step in the training pipeline for most production LLMs, including InstructGPT, ChatGPT, Claude, Llama, Flan-T5, and Gemma. It is typically applied after pretraining and before optional alignment procedures such as reinforcement learning from human feedback (RLHF) or direct preference optimization (DPO).

Explain like I'm 5 (ELI5)

Imagine you have a really smart parrot that has read every book in the world. It knows tons of words and facts, but if you ask it "Please summarize this story," it might just keep talking about random things instead of giving you a summary. The parrot knows the information, but it doesn't understand what you want it to do.

Instruction tuning is like showing the parrot lots of examples: "When someone says 'summarize this,' here is what a good summary looks like. When someone says 'translate this to French,' here is what a good translation looks like." After seeing enough examples, the parrot learns the pattern. Now when you ask it to do something new it has never practiced before, like "Write a poem about the ocean," it understands that you want a poem and produces one, because it learned the general idea of following instructions.

Definition and relationship to fine-tuning

Instruction tuning is a specialized form of supervised fine-tuning (SFT). While standard fine-tuning adapts a pretrained model to a single downstream task (for example, fine-tuning BERT for sentiment classification), instruction tuning trains the model on many tasks simultaneously, with each task expressed as a natural language instruction.^[1] The key distinction is that instruction tuning aims to teach a model the general skill of following instructions rather than optimizing performance on any single task.

In standard fine-tuning, the model learns task-specific input-output mappings. In instruction tuning, each training example includes an explicit description of what the model should do, which encourages the model to generalize to new, unseen instructions at inference time. This is why instruction-tuned models can handle tasks they were never explicitly trained on, a property known as zero-shot task generalization.^[3]

Data format

Instruction tuning datasets are structured around three components:^[2]

Instruction: a natural language description of the task to perform (e.g., "Translate the following sentence to French" or "Classify the sentiment of this review as positive or negative").
Input (optional): additional context or data that the task operates on (e.g., the sentence to translate or the review text).
Output: the desired response that correctly fulfills the instruction given the input.

A standard template used in many instruction tuning datasets (including Alpaca) looks like this:

Below is an instruction that describes a task, paired with an input
that provides further context. Write a response that appropriately
completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:
{output}

For tasks that do not require separate input context, the input field is omitted and the instruction alone describes the full task.

Examples

Instruction	Input	Output
Translate the following sentence to French.	The cat sat on the mat.	Le chat s'est assis sur le tapis.
Is the following review positive or negative?	This movie was absolutely terrible, a waste of time.	Negative
Summarize the following paragraph in one sentence.	(a long paragraph about photosynthesis)	Photosynthesis is the process by which plants convert sunlight, water, and carbon dioxide into glucose and oxygen.
Write a haiku about winter.	(none)	Snowflakes softly fall / Blanketing the world in white / Silence fills the air

Early instruction tuning datasets were constructed by reformatting existing NLP benchmarks into this instruction-input-output structure using manually written templates. For example, FLAN converted datasets like SNLI, WMT, and SQuAD into instruction format by writing 10 unique templates per task.^[3] Later approaches used LLMs themselves to generate instruction data, as in Self-Instruct and Alpaca.

History and key developments

FLAN (Wei et al., 2022)

FLAN (Finetuned Language Net) was one of the first large-scale demonstrations of instruction tuning.^[3] The researchers took a 137B-parameter pretrained language model (LaMDA-PT) and fine-tuned it on over 60 NLP datasets, each converted into an instruction format using manually written templates. Each dataset was associated with up to 10 different instruction templates to promote diversity.

The key finding was that FLAN substantially outperformed the untuned base model on zero-shot evaluation and surpassed zero-shot GPT-3 (175B parameters) on 20 of 25 evaluation tasks. FLAN also outperformed few-shot GPT-3 by large margins on benchmarks including ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze. Ablation studies revealed that the number of fine-tuning tasks, model scale, and the use of natural language instructions were all critical to success. Notably, performance improvements from instruction tuning only emerged with sufficient model scale, suggesting that smaller models lacked the capacity to benefit from the approach.

T0 (Sanh et al., 2022)

T0, developed through the BigScience Workshop, explored whether zero-shot generalization could be achieved through explicit multitask prompted training.^[4] The researchers compiled a collection called the Public Pool of Prompts (P3), which contained prompts for 170 English NLP datasets organized across 2,052 unique prompt templates. They fine-tuned a T5-based encoder-decoder model (T5+LM) on a subset of these tasks using multiple prompts per task.

T0 matched or exceeded the performance of GPT-3 on many benchmarks while being approximately 16 times smaller (11B vs. 175B parameters). This result demonstrated that a well-designed multitask prompted training regime could compensate for a significant gap in model scale. The work also highlighted the importance of prompt diversity: using multiple prompt templates per task improved generalization compared to using a single template.

InstructGPT (Ouyang et al., 2022)

InstructGPT combined instruction tuning with reinforcement learning from human feedback in a three-step pipeline:^[5]

Supervised fine-tuning (SFT): GPT-3 was fine-tuned on approximately 13,000 demonstration prompts with human-written responses, collected from a team of 40 labelers.
Reward model training: A reward model was trained on approximately 33,000 prompts with human-ranked model outputs to learn which responses humans preferred.
Reinforcement learning: The SFT model was further optimized using proximal policy optimization (PPO) against the learned reward model, using approximately 31,000 additional prompts.

The most striking result was that the 1.3B-parameter InstructGPT model was preferred by human evaluators over the 175B-parameter GPT-3, despite being over 100 times smaller. InstructGPT also showed improvements in truthfulness and reductions in toxic output generation. This work demonstrated that instruction tuning (step 1) combined with RLHF (steps 2 and 3) could dramatically improve model alignment with minimal performance regressions on standard NLP benchmarks.

Scaling instruction-finetuned language models (Chung et al., 2022)

Chung et al. (2022) scaled instruction tuning along three dimensions: the number of tasks, model size, and the inclusion of chain-of-thought (CoT) reasoning data.^[6] Their work produced Flan-T5 and Flan-PaLM by fine-tuning T5 and PaLM on 1,836 tasks, a significant increase from the 62 tasks used in the original FLAN.

Key results included:

Flan-PaLM 540B outperformed the base PaLM 540B by an average of 9.4% across evaluation benchmarks.
Flan-PaLM 540B achieved 75.2% on five-shot MMLU, a state-of-the-art result at the time.
Flan-T5 (11B) achieved few-shot performance competitive with the much larger PaLM 62B.
Including chain-of-thought examples in the fine-tuning data improved the model's reasoning abilities.

The publicly released Flan-T5 checkpoints became widely adopted as strong general-purpose models for both research and applications.

Instruction tuning datasets

The quality and diversity of instruction tuning data have a direct impact on the resulting model's capabilities. Datasets can be broadly categorized into human-crafted datasets and synthetically generated datasets.

Human-crafted datasets

Dataset	Year	Size	Description
Natural Instructions (Mishra et al.)	2022	193K instances, 61 tasks	One of the first instruction datasets, containing crowdsourced task instructions and instances.^[7]
P3 / Public Pool of Prompts	2022	2,052 prompts, 170 datasets	Prompt collection created for T0 training, covering a broad range of English NLP tasks.^[4]
Super-Natural Instructions (Wang et al.)	2022	5M instances, 1,616 tasks	A large-scale benchmark spanning 76 task types across 55 languages, with expert-written instructions for each task.^[8]
FLAN Collection	2022	1,836 tasks	Aggregation of multiple instruction datasets used to train Flan-T5 and Flan-PaLM.^[6]
Dolly (Databricks)	2023	15K instances	Human-generated instruction data covering seven task categories, released under a commercial-use license.
OpenAssistant (LAION)	2023	161K messages, 35 languages	Crowd-sourced multi-turn conversation dataset with human quality ratings.
LIMA (Zhou et al.)	2023	1,000 instances	A small but carefully curated dataset that demonstrated strong performance, supporting the "superficial alignment hypothesis."^[9]

Synthetically generated datasets

Dataset	Year	Size	Generation method
Self-Instruct (Wang et al.)	2023	52K instructions	Bootstrapped from 175 seed examples using GPT-3's own generations.^[10]
Alpaca (Taori et al.)	2023	52K instances	Generated using text-davinci-003 based on the Self-Instruct pipeline, costing under $500.^[11]
WizardLM / Evol-Instruct	2023	70K instances	Progressively more complex instructions generated by ChatGPT using an evolutionary approach.
Orca (Mukherjee et al.)	2023	1M instances	GPT-4 responses with detailed reasoning traces, distilled from FLAN Collection queries.
OpenOrca	2023	~4.2M instances	Open-source reproduction of the Orca dataset, combining GPT-3.5 and GPT-4 completions.
UltraChat (Ding et al.)	2023	1.5M dialogues	Multi-turn conversations generated through model self-play across diverse topics.

Self-Instruct (Wang et al., 2023)

Self-Instruct introduced a method for generating instruction tuning data from a model's own outputs, reducing the dependency on expensive human annotation.^[10] The pipeline works as follows:

Seed task pool: Start with a small set of 175 manually written tasks, each with an instruction, input, and output.
Instruction generation: Prompt the language model to generate new task instructions, using existing instructions from the pool as in-context examples.
Classification: Determine whether each new instruction requires an input field or is self-contained.
Instance generation: For each new instruction, prompt the model to generate corresponding input-output pairs.
Filtering: Remove low-quality, duplicate, or overly similar instructions using heuristic rules (e.g., ROUGE-L similarity thresholds).
Iteration: Add the filtered instructions to the task pool and repeat.

Applying Self-Instruct to vanilla GPT-3 produced a 33% absolute improvement on the Super-Natural Instructions benchmark, reaching performance comparable to InstructGPT-001 (which was trained with private user data and human annotations). Human evaluators found that GPT-3 fine-tuned with Self-Instruct data outperformed models trained on existing public instruction datasets by a large margin, leaving only a 5% absolute gap behind InstructGPT-001.

The Self-Instruct framework had a direct influence on Stanford's Alpaca project, which adapted the pipeline to generate 52,000 instruction-following examples from text-davinci-003 at a cost of under $500. This made instruction tuning accessible to research groups without large annotation budgets.^[11]

Scaling instruction diversity

Research has consistently shown that increasing the number and diversity of instruction tuning tasks improves model generalization. The original FLAN used 62 tasks; Flan-v2 scaled to 1,836 tasks, and the improvement in benchmark performance was substantial across model families and evaluation settings.^[3]^[6]

Several factors contribute to the effect of diversity:

Task variety: Training on more task types (classification, generation, extraction, translation, reasoning) helps the model recognize a broader range of instruction patterns.
Prompt variety: Using multiple prompt templates per task, as in T0's P3 collection, reduces the model's sensitivity to specific phrasings and improves robustness.
Language diversity: Multilingual instruction datasets such as xP3 (46 languages) and Super-Natural Instructions (55 languages) extend instruction-following capabilities beyond English.
Complexity progression: Approaches like Evol-Instruct (used in WizardLM) systematically increase instruction difficulty, which improves the model's ability to handle complex, multi-step tasks.

However, the LIMA experiment by Zhou et al. (2023) offered a counterpoint: a model fine-tuned on just 1,000 carefully selected, high-quality examples achieved performance competitive with models trained on far larger datasets.^[9] In a controlled human study, LIMA's responses were preferred over or equivalent to GPT-4 in 43% of cases. This result supported what the authors called the "superficial alignment hypothesis," which proposes that a model's knowledge and capabilities are almost entirely acquired during pretraining, and instruction tuning primarily teaches the model the format and style of desired outputs rather than new knowledge.

The practical implication is that both quantity and quality matter, but quality may matter more. A small number of well-chosen, diverse, high-quality examples can be surprisingly effective.

Comparison with RLHF

Instruction tuning and RLHF are complementary alignment techniques that serve different purposes in the LLM training pipeline.

Aspect	Instruction tuning	RLHF
Training signal	Supervised: (instruction, output) pairs	Preference-based: human rankings of outputs
What it teaches	How to follow instructions and produce correct outputs	How to produce outputs that humans prefer
Data requirements	Demonstration data (input-output examples)	Comparison data (which output is better)
Optimization method	Standard cross-entropy loss	Reward model + policy gradient (e.g., PPO)
Typical position in pipeline	After pretraining	After instruction tuning
Strengths	Teaches task execution, factual accuracy, format compliance	Improves helpfulness, safety, tone, and reduces harmful outputs
Limitations	Cannot easily optimize for subjective preferences	More complex to implement, can cause reward hacking
Can be used independently	Yes	Yes, but usually builds on SFT

In practice, most production LLMs use both techniques in sequence. InstructGPT established the now-standard three-stage pipeline: (1) supervised fine-tuning on instruction data, (2) reward model training from human preferences, and (3) reinforcement learning optimization.^[5] More recent alternatives to RLHF, such as DPO (direct preference optimization), simplify the preference learning step by eliminating the need for a separate reward model, but instruction tuning remains the first alignment step in nearly all approaches.

It is worth noting that instruction tuning and RLHF are orthogonal: a model can be instruction-tuned without RLHF, trained with RLHF without instruction tuning, or trained with both. However, combining them typically yields the best results.

Efficient instruction tuning methods

Full fine-tuning of large language models requires updating all model parameters, which is computationally expensive. Several parameter-efficient fine-tuning (PEFT) methods have been developed to make instruction tuning more accessible:

Method	Description	Parameters updated
LoRA (Hu et al., 2022)	Adds low-rank decomposition matrices to attention layers; only these small matrices are trained.	~0.1-1% of total
QLoRA (Dettmers et al., 2023)	Combines LoRA with 4-bit quantization of the base model to further reduce memory usage.	~0.1-1% of total
Prefix tuning (Li and Liang, 2021)	Prepends trainable continuous vectors to the input at each layer.	<1% of total
Adapter layers (Houlsby et al., 2019)	Inserts small trainable bottleneck layers between existing transformer layers.	~1-5% of total

LoRA and QLoRA have become the most popular choices for instruction tuning in resource-constrained settings. QLoRA in particular enables instruction tuning of models with tens of billions of parameters on a single consumer GPU.

Criticisms and limitations

Instruction tuning has several known limitations:

Superficial alignment: Some researchers argue that instruction tuning teaches models to mimic output formats and styles rather than acquiring genuine task understanding. The LIMA results support this interpretation: strong performance from only 1,000 examples suggests the model already possesses the underlying capabilities from pretraining, and instruction tuning merely unlocks them by teaching the preferred response format.^[9]
Dataset quality concerns: Synthetically generated datasets (e.g., Alpaca, Self-Instruct) can contain errors, hallucinations, and biases inherited from the teacher model used to generate the data. This can propagate or amplify problems during fine-tuning.
Limited generalization beyond training distribution: Empirical studies have found that instruction tuning primarily improves performance on task types that are well-represented in the training data. Tasks that differ substantially from the training distribution may not benefit.^[2]
Catastrophic forgetting: Aggressive instruction tuning can degrade the model's performance on tasks or knowledge it acquired during pretraining, particularly when the instruction tuning dataset is narrow or the training is run for too many epochs.
Benchmark contamination: When instruction tuning data overlaps with evaluation benchmarks, reported improvements may reflect memorization rather than genuine generalization.

Applications

Instruction tuning is used across a wide range of applications:

Conversational AI: Chat-oriented models like ChatGPT, Claude, and open-source alternatives such as Vicuna and Llama Chat are built on instruction-tuned base models.
Code generation: Models such as Code Llama and WizardCoder use instruction tuning with code-specific tasks to improve programming assistance capabilities.
Domain-specific adaptation: Instruction tuning enables rapid specialization. Medical models (ChatDoctor, Med-PaLM), legal models, and financial models are created by instruction tuning general LLMs on domain-specific instruction data.
Multilingual NLP: Instruction tuning on multilingual datasets (xP3, Bactrian-X) extends instruction-following capabilities to languages beyond English, enabling cross-lingual transfer.
Multimodal models: The instruction tuning paradigm has been extended to vision-language models. LLaVA, InstructBLIP, and Vision-Flan apply instruction tuning to tasks that combine text and images.
Tool use: Instruction tuning can teach models to use external tools such as calculators, search engines, or APIs by including tool-use demonstrations in the training data.

References

Zhang, S., Dong, L., Li, X., Zhang, S., Sun, X., Wang, S., Li, J., Hu, R., Zhang, T., Wu, F., & Wang, G. (2024). "Instruction Tuning for Large Language Models: A Survey." *ACM Computing Surveys*. arXiv:2308.10792
Ruder, S. (2023). "An Overview of Instruction Tuning Data." *NLP Newsletter*. https://www.ruder.io/an-overview-of-instruction-tuning-data/
Wei, J., Bosma, M., Zhao, V. Y., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M., & Le, Q. V. (2022). "Finetuned Language Models Are Zero-Shot Learners." *ICLR 2022*. arXiv:2109.01652
Sanh, V., Webson, A., Raffel, C., Bach, S. H., Sutawika, L., Alyafeai, Z., ... & Rush, A. M. (2022). "Multitask Prompted Training Enables Zero-Shot Task Generalization." *ICLR 2022*. arXiv:2110.08207
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., ... & Lowe, R. (2022). "Training language models to follow instructions with human feedback." *NeurIPS 2022*. arXiv:2203.02155
Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., ... & Dean, J. (2022). "Scaling Instruction-Finetuned Language Models." *JMLR 2024*. arXiv:2210.11416
Mishra, S., Khashabi, D., Baral, C., & Hajishirzi, H. (2022). "Cross-Task Generalization via Natural Language Crowdsourcing Instructions." *ACL 2022*. arXiv:2104.08773
Wang, Y., Mishra, S., Alipoormolabashi, P., Kordi, Y., Miber, A., Frisch, D., ... & Khashabi, D. (2022). "Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks." *EMNLP 2022*. arXiv:2204.07705
Zhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., Mao, Y., ... & Levy, O. (2023). "LIMA: Less Is More for Alignment." *NeurIPS 2023*. arXiv:2305.11206
Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N. A., Khashabi, D., & Hajishirzi, H. (2023). "Self-Instruct: Aligning Language Models with Self-Generated Instructions." *ACL 2023*. arXiv:2212.10560
Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., & Hashimoto, T. B. (2023). "Stanford Alpaca: An Instruction-following LLaMA Model." *Stanford CRFM*. https://crfm.stanford.edu/2023/03/13/alpaca.html
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2022). "LoRA: Low-Rank Adaptation of Large Language Models." *ICLR 2022*. arXiv:2106.09685
Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). "QLoRA: Efficient Finetuning of Quantized Language Models." *NeurIPS 2023*. arXiv:2305.14314

Explain like I'm 5 (ELI5)

Definition and relationship to fine-tuning

Data format

Examples

History and key developments

FLAN (Wei et al., 2022)

T0 (Sanh et al., 2022)

InstructGPT (Ouyang et al., 2022)

Scaling instruction-finetuned language models (Chung et al., 2022)

Instruction tuning datasets

Human-crafted datasets

Synthetically generated datasets

Self-Instruct (Wang et al., 2023)

Scaling instruction diversity

Comparison with RLHF

Efficient instruction tuning methods

Criticisms and limitations

Applications

References

Improve this article

Related Articles

Context window

Post-training

OCR Models

Pre-training

Supervised fine-tuning

ARC-AGI 2

Explain like I'm 5 (ELI5)

Definition and relationship to fine-tuning

Data format

Examples

History and key developments

FLAN (Wei et al., 2022)

T0 (Sanh et al., 2022)

InstructGPT (Ouyang et al., 2022)

Scaling instruction-finetuned language models (Chung et al., 2022)

Instruction tuning datasets

Human-crafted datasets

Synthetically generated datasets

Self-Instruct (Wang et al., 2023)

Scaling instruction diversity

Comparison with RLHF

Efficient instruction tuning methods

Criticisms and limitations

Applications

References

Related Articles

Context window

Post-training

OCR Models

Pre-training

Supervised fine-tuning

ARC-AGI 2