Self-Instruct
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,152 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,152 words
Add missing citations, update stale details, or suggest a clearer explanation.
Self-Instruct is a semi-automated framework and accompanying dataset for aligning pretrained large language models with instruction-following behavior by bootstrapping training data from the model itself. Introduced in December 2022 by Yizhong Wang and collaborators at the University of Washington, the Allen Institute for AI, Tehran Polytechnic, Arizona State University, and Johns Hopkins University, the method begins from a small seed pool of 175 human-written tasks and uses a vanilla pretrained model (originally GPT-3 "davinci") to generate fresh instructions, classify their type, produce input-output instances, and filter the result with diversity heuristics.[1][2] The published artifact is a corpus of approximately 52,000 instructions paired with about 82,000 instances, released on GitHub under the Apache 2.0 license.[2] When the authors fine-tuned vanilla GPT-3 on the generated data, the resulting model improved by 33 percentage points on Super-NaturalInstructions over the base checkpoint and came within five points of OpenAI's InstructGPT-001 on a held-out human evaluation, establishing the recipe that Stanford's Alpaca would later use to train an open instruction-tuned model for roughly five hundred dollars in API costs.[1][3] The paper was accepted at the 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023) in Toronto.[4]
| Field | Value |
|---|---|
| Title | Self-Instruct: Aligning Language Models with Self-Generated Instructions |
| Authors | Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, Hannaneh Hajishirzi |
| Affiliations | University of Washington, Tehran Polytechnic, Arizona State University, Allen Institute for AI, Johns Hopkins University |
| First posted | 2022-12-20 (arXiv 2212.10560) |
| Venue | ACL 2023 Long Papers, pages 13484-13508 |
| DOI | 10.18653/v1/2023.acl-long.754 |
| Seed pool | 175 tasks (25 classification, 150 non-classification) |
| Released data | 52,445 instructions, 82,439 instances |
| Base model used | GPT-3 175B ("davinci" engine) |
| Code license | Apache-2.0 |
| Repository | github.com/yizhongw/self-instruct |
The paper's central claim is empirical rather than theoretical: that the latent ability of a sufficiently capable pretrained model to produce plausible task descriptions and worked examples can be reinvested as supervision for itself, replacing expensive human labeling without sacrificing downstream quality.[1] Self-Instruct sits between two prior families of work. On one side stood human-curated instruction collections such as the Public Pool of Prompts feeding T0 and the broader instruction-tuning literature represented by FLAN and the Tk-Instruct line drawn from Super-NaturalInstructions.[1] On the other side stood proprietary pipelines like the one OpenAI used to build InstructGPT, which relied on customer prompts and contractor demonstrations that no outside group could replicate.[5] Self-Instruct splits the difference: 175 hand-written tasks plus the API of a strong but un-aligned base model produces a corpus large enough to push that same base model close to the proprietary frontier.[1]
By late 2022 several research lines had converged on the idea that supervised fine-tuning on instruction-formatted data unlocks zero-shot generalization in large language models.[1] BigScience's T0, an 11B-parameter T5-based model trained on a manually curated prompt collection covering dozens of natural language processing datasets, demonstrated that prompted multitask training enabled new-task transfer.[6] Google's FLAN line and the Allen Institute's Tk-Instruct followed, the latter using the 1,616-task Super-NaturalInstructions benchmark as its training pool.[1] In parallel, OpenAI shipped InstructGPT (also called text-davinci-001 in its earliest deployed form) using a combination of human-written prompt-response demonstrations and Reinforcement Learning from Human Feedback.[5]
All of these efforts shared a bottleneck. The instruction corpora were either limited in surface diversity (academic NLP datasets reworded into prompts) or expensive to scale (paid contractor labor at OpenAI's scale).[1] Yizhong Wang and coauthors framed the problem in their abstract as a creativity ceiling: "large 'instruction-tuned' language models, finetuned on instructions and human feedback, have demonstrated a remarkable ability to generalize zero-shot to new tasks. Nevertheless, they depend heavily on human-written instruction data that is limited in quantity, diversity, and creativity, therefore hindering the generality of the tuned model."[1]
The first author, Yizhong Wang, was a PhD student in the Paul G. Allen School of Computer Science and Engineering at the University of Washington, advised by Hannaneh Hajishirzi and Noah A. Smith, both of whom also held appointments at the Allen Institute for AI.[7] Coauthor Yeganeh Kordi was at Tehran Polytechnic (Amirkabir University of Technology); Swaroop Mishra was at Arizona State University; Alisa Liu was a graduate student at UW; and Daniel Khashabi, previously at AI2, was a faculty member at Johns Hopkins University.[1][7] The intellectual lineage runs through the Allen Institute's earlier work on Natural Instructions and Super-NaturalInstructions, which provided the evaluation benchmark used in the paper.[1]
The first arXiv submission is dated 2022-12-20, six weeks after ChatGPT's launch and less than a month after OpenAI introduced text-davinci-003.[1] A revised v2 was posted 2023-05-25 to align with the ACL camera-ready.[1]
Self-Instruct is presented in the paper as a four-step pipeline operating on a growing pool of tasks.[1] Each task is a tuple of one natural-language instruction and one or more instances, where an instance contains an input (which may be empty) and the corresponding output. The pipeline starts with 175 hand-crafted tasks and iterates until a target dataset size is reached.[1][2]
In each iteration the system samples eight instructions from the task pool as in-context demonstrations: six drawn from the original 175 human-written seed tasks and two drawn from previously machine-generated tasks already accepted into the pool.[1][8] These eight examples are concatenated into a prompt that asks the base language model to produce additional task instructions in the same style. The mix of human and machine examples is deliberate: drawing only from human seeds would bias the model toward the seed distribution, while drawing only from earlier generations would allow drift to compound.[1]
The base model used in the original paper is GPT-3's "davinci" engine at 175B parameters.[1][3] In the released code each generation call asks for several instructions at once and parses them out, amortizing the API cost.[2]
A surprising practical finding documented in the paper is that classification tasks (where the output is drawn from a small label set) and open-ended generation tasks require different prompting strategies during instance generation.[1] To route each new instruction correctly, Self-Instruct prompts the base model in a few-shot manner with 12 classification instructions and 19 non-classification instructions taken from the seed pool, asking whether a given new instruction is or is not a classification task.[1][8] The router itself is the same base language model, which makes the entire pipeline runnable from a single API.
The third step asks the model to produce inputs and outputs for each new instruction. Two prompting templates handle the routing decided in Step 2.[1]
For non-classification tasks the system uses an input-first template: the model is asked to first hallucinate a plausible input for the instruction and then to write the matching output. This produces realistic-looking inputs at the cost of letting the model commit to an output it must justify.[1] For classification tasks the input-first approach has a known failure mode: the model tends to overproduce inputs that map to the most common label and ignore minority classes. Self-Instruct therefore uses an output-first template for classification: the label is sampled first from the inferred label set, and the input is generated conditional on that label, which balances class coverage.[1]
Filtering is the load-bearing component that prevents collapse of diversity over hundreds of thousands of API calls.[1] Three filters operate.
The first is a ROUGE-L diversity threshold: a new candidate instruction is added to the task pool only if its ROUGE-L similarity with every existing instruction in the pool is less than 0.7.[1][8] This is the mechanism the paper relies on to keep the corpus from collapsing into paraphrases of the same handful of tasks.
The second filter is a heuristic checklist: instructions are rejected if they contain certain keywords typically associated with unanswerable or visual tasks (for example, those asking the model to look at an image or pick up an object), if they are too short or too long, or if the input-output pair simply repeats the instruction.[1] The exact keyword list and the duplicate-input heuristic are visible in the released code.[2]
The third filter is exact-match deduplication against any existing instance with the same input.[1] The combined effect is a sharp drop in apparent corpus size from the raw API output to the final dataset.
The four steps repeat. Each iteration enlarges the task pool, the pool is sampled to seed the next iteration's prompts, and the process continues until budget exhaustion. The paper reports that the iterative procedure converged on 52,445 unique instructions paired with 82,439 instances, of which 11,584 are classification tasks and 40,861 are non-classification tasks.[1][8]
The released corpus, hosted in the public yizhongw/self-instruct repository, contains the model-generated instructions in JSONL form alongside the seed tasks and reformatted fine-tuning splits.[2] The seed file seed_tasks.jsonl is the literal 175-task pool that bootstraps the pipeline and was later inherited verbatim by Stanford Alpaca and several other downstream projects.[2][9] The model-generated portion, in data/gpt3-generations/batch_221203/all_instances_82K.jsonl, contains the 82K instances; a reformatted version suitable for fine-tuning lives under data/finetuning/self_instruct_221203.[2] The repository also ships 252 expert-written user-oriented evaluation instructions covering 119 different domains, used for the paper's human evaluation.[2] The code and data carry the Apache-2.0 license.[2]
The authors annotated 200 random instructions to estimate the quality of the corpus and reported that 46 percent of the data points had one or more problems, ranging from incorrect outputs to instructions paired with unrelated inputs.[2] They recommended caution and explicitly invited the community to develop improved filtering methods.[2]
The paper evaluates Self-Instruct on two settings: the Super-NaturalInstructions benchmark, which measures generalization to a held-out set of NLP tasks, and a hand-curated set of 252 user-oriented prompts evaluated by human raters.[1]
On the Super-NaturalInstructions test split, the paper reports the following ROUGE-L scores, which are the SuperNI standard metric.[1][8]
| Model | Parameters | SuperNI ROUGE-L |
|---|---|---|
| Vanilla GPT-3 (davinci) | 175B | 6.8 |
| T0 | 11B | 33.1 |
| GPT-3 + Self-Instruct | 175B | 39.9 |
| InstructGPT-001 | 175B | 40.8 |
The headline number from the paper is the 33-point absolute improvement that Self-Instruct training delivered over vanilla GPT-3, and the fact that the resulting open recipe came within 0.9 points of OpenAI's then-current InstructGPT-001.[1][3] T0 at 11B parameters is the closest publicly available instruction-tuned model in the comparison and is substantially smaller, although the comparison across parameter counts is not strictly apples to apples.[1]
For tasks more representative of real user prompts the authors collected 252 instructions over 119 application domains and asked four human annotators to rate each model's responses on a four-point scale (A best to D worst).[1] Human raters preferred GPT-3 trained with Self-Instruct to GPT-3 fine-tuned on the T0 training data or on the Super-NaturalInstructions training set by a wide margin, and the gap between Self-Instruct GPT-3 and InstructGPT-001 narrowed to roughly five points.[1] The authors interpreted this gap as evidence that self-generated data can substantially close, although not eliminate, the distance to closed-source proprietary alignment pipelines.[1]
The supervised fine-tuning was performed via the OpenAI fine-tuning API on the "davinci" engine, using two epochs with default hyperparameters except that the prompt-loss weight was set to zero so that the loss was computed only on the target outputs.[1][3] The paper does not report a dollar figure for the fine-tuning itself; the cost of the data generation step was approximately six hundred dollars in OpenAI credits according to community write-ups, with Stanford Alpaca later replicating a comparable pipeline for under five hundred dollars using text-davinci-003.[9]
Self-Instruct moved very quickly from research artifact to recipe. Within four months of the arXiv release, the same 175 seed tasks were the starting point for a sequence of widely used open instruction-tuned models, and the underlying pipeline diversified into several research subgenera.
The most visible adopter was Stanford's CRFM Alpaca project, released 2023-03-13. Alpaca explicitly adopts the Self-Instruct pipeline. The Stanford team wrote that it "started with the 175 human-written instruction-output pairs from the self-instruct seed set" and then used the OpenAI API to prompt OpenAI's text-davinci-003 for additional examples, choosing the newer model over the davinci endpoint that the original paper had used.[9] Three modifications to the pipeline are documented: a clearer batch-generation prompt that asks for twenty instructions at a time, removal of the classification-task vs non-classification-task distinction with a single unified prompt, and producing only one instance per instruction rather than two or three.[9] The output is a corpus of 52,000 examples (the alpaca_data.json file) generated for under five hundred dollars in API costs, used to fine-tune a LLaMA 7B base model on eight A100 GPUs in roughly three hours.[9]
Alpaca's dataset became a de facto benchmark in its own right, and its quirks (including frequent references to a January 2023 knowledge cutoff inherited from text-davinci-003) propagated into many derivative datasets.[9]
Vicuna, released in March 2023 by a team from UC Berkeley, CMU, Stanford, UC San Diego, and Mohamed bin Zayed University, fine-tuned LLaMA on roughly seventy thousand conversations scraped from ShareGPT user uploads.[10] Although Vicuna's data source is different from Self-Instruct's (human-shared dialogues rather than seed-based bootstrapping), the underlying philosophy of training an open model on instruction-style data harvested from a stronger system is a direct descendant of the Self-Instruct paradigm.[10]
Databricks took the opposite turn with Dolly 2.0, released 2023-04-12. The accompanying databricks-dolly-15k dataset is a 15,000-example human-generated instruction corpus contributed by more than five thousand Databricks employees over March and April 2023.[11] Dolly was explicitly positioned as a commercially licensable alternative to the OpenAI-generated Alpaca data, whose terms of use prevented using it to train models competing with OpenAI's services.[11] In that sense Dolly 2.0 is both a complement to and a critique of the Self-Instruct synthetic-data paradigm: it shows that 15K hand-written examples can produce a usable instruction-tuned model without depending on a proprietary distillation source.[11]
Code Alpaca, released by Sahil Chaudhary in March 2023, adapted the Self-Instruct pipeline to the code domain. Its 20K-example dataset (code_alpaca_20k.json) was generated by prompting text-davinci-003 with seed tasks rewritten to focus on code generation, editing, and optimization, using essentially the Alpaca pipeline minus the classification distinction.[12] Code Alpaca's training data cost was under two hundred dollars.[12]
The CAMEL project (Communicative Agents for "Mind" Exploration of Large Language Model Society), arXiv:2303.17760, generalized the bootstrapping idea to multi-agent role-play. Instead of seeding from 175 human-written tasks, CAMEL used inception prompting to have one chat agent take the role of a user and another the role of an assistant, producing conversational instruction data covering software engineering and other domains.[13]
WizardLM, introduced by Xu et al. in April 2023 (arXiv:2304.12244, ICLR 2024), proposed Evol-Instruct as a successor to Self-Instruct's diversity heuristics.[14] Where Self-Instruct relies on the base model's stylistic variation across prompts and the ROUGE-L threshold to grow the corpus, Evol-Instruct uses a small fixed set of "evolution" operators (adding constraints, increasing reasoning depth, concretizing abstract steps, and so on) that an LLM applies repeatedly to existing instructions to make them more complex.[14] The paper reports that WizardLM, trained on Evol-Instruct data, outperforms Alpaca and Vicuna on a complex-instruction benchmark.[14] The follow-up WizardCoder (arXiv:2306.08568) ports the same idea to the code domain.[15]
A separate strand, Principle-Driven Self-Alignment by Sun et al. (arXiv:2305.03047), uses Self-Instruct-style bootstrapping not to generate task data but to instantiate a small handcrafted set of behavioral principles into a much larger training corpus.[16] The resulting Dromedary model, trained from a base LLaMA, requires only six in-context exemplars and thirty-one principles of human supervision, and the same authors later extended the approach with SALMON (arXiv:2310.05910), which trains a reward model on principle-following responses.[17]
Knowledge Distillation-adjacent uses of the same intuition include Self-Refine (Madaan et al., arXiv:2303.17651, NeurIPS 2023), which uses the same underlying language model to generate, critique, and revise its outputs in a refinement loop.[18] Self-Refine is not a data-generation pipeline like Self-Instruct, but it shares the "model-as-its-own-supervisor" principle and is frequently cited alongside it as a sibling technique.[18]
The Allen Institute for AI, where several of the Self-Instruct authors are affiliated, ran a sequence of open-instruct experiments and released a series of Tülu models trained on mixtures of human-written and Self-Instruct-style synthetic data, providing some of the most carefully ablated evidence that a fraction of bootstrapped instructions contributes meaningfully to downstream performance.[7]
Self-Instruct's influence on the open-source instruction tuning ecosystem is difficult to overstate. By the spring of 2023 essentially every open instruction-tuned model that did not depend on user-contributed conversations was trained on data generated by Self-Instruct or a Self-Instruct derivative.[9][12][14] The Alpaca dataset alone was forked thousands of times and translated into dozens of languages.[9]
Three structural contributions stand out.
First, the paper demonstrated that the 175-seed-task threshold is sufficient. The viability of bootstrapping at this scale established that a small research group, without access to a labeled corpus or a fleet of contractors, could produce instruction data competitive with what frontier labs were generating internally.[1] This shifted the cost structure of alignment work and is a precondition for the wave of small-budget instruction-tuned models that followed.[9][11]
Second, Self-Instruct made the role of synthetic data in alignment legible. Earlier instruction-tuning work had treated data as a fixed asset (curated by humans or harvested from existing datasets); Self-Instruct made the data-generation pipeline itself an object of research, opening the door to follow-on work on filtering, evolution operators, and principle-driven generation.[14][16]
Third, the published in-context learning format used in Self-Instruct's generation prompts (eight demonstrations, six from human pool and two from machine pool) became a small template that subsequent open-source pipelines, including Alpaca's, copied or simplified.[9]
The paper and the associated GitHub release are unusually candid about limitations.
The authors annotated a random sample of two hundred generated instructions and found that 46 percent had at least one problem: the instruction could be ambiguous, the input could mismatch the instruction, or the output could be incorrect.[2] They note that "most of the generated instructions are meaningful, while the generated instances may contain more noise (to a reasonable extent)" and recommend caution when using the corpus.[2]
The ROUGE-L 0.7 threshold has a known failure mode: it controls surface-form diversity but not semantic diversity, so two instructions that paraphrase the same underlying task in different vocabulary can both pass.[1] As the corpus grows the proportion of new candidate instructions accepted shrinks, which both limits the practical scale of a Self-Instruct run and means later iterations spend most of their cost generating duplicates that are then discarded.[1]
Because all generated examples descend from the base model's prior, Self-Instruct inherits the biases of GPT-3 in both topic distribution and stylistic register.[1] The paper acknowledges that the long tail of rare instruction types is underrepresented, since these are precisely the cases for which the base model has weak priors.[1] Critics have argued that this risks reinforcing the existing distribution of language model behavior rather than expanding it.
A more general critique that emerged after Stanford Alpaca's release is that Self-Instruct, when run against a proprietary model such as text-davinci-003 or ChatGPT, functionally distills the proprietary model into the open one without paying the original training cost.[9] This raises both legal questions about OpenAI's terms of use (which Alpaca's release page explicitly flagged) and methodological questions about whether benchmarks that reward proprietary-style outputs are measuring genuine instruction-following or imitation of a particular vendor.[9]
ROUGE-L on Super-NaturalInstructions is the metric reported in the paper, but the broader community has documented that ROUGE-L is noisy at distinguishing instruction-following quality and can be gamed by output length and surface-form similarity.[1] Subsequent evaluation harnesses such as AlpacaEval and human-vote arenas were developed in part to address this shortcoming.
| Work | arXiv | Relation |
|---|---|---|
| InstructGPT (Ouyang et al.) | 2203.02155 | Proprietary baseline that Self-Instruct approaches with open data |
| T0 (Sanh et al.) | 2110.08207 | Earlier instruction-tuned model used as a non-proprietary comparison |
| Alpaca (Stanford CRFM) | n/a (blog) | Direct application of Self-Instruct using text-davinci-003 |
| Vicuna (LMSys et al.) | n/a (blog) | Instruction tuning on ShareGPT conversations rather than seed-based bootstrap |
| Dolly 2.0 (Databricks) | n/a (blog) | Human-written counterpoint to synthetic instruction data |
| Code Alpaca (Chaudhary) | n/a (repo) | Self-Instruct pipeline restricted to code-domain seeds |
| CAMEL (Li et al.) | 2303.17760 | Multi-agent role-play generalization |
| WizardLM Evol-Instruct (Xu et al.) | 2304.12244 | Replaces ROUGE-L diversity with evolution operators |
| WizardCoder (Luo et al.) | 2306.08568 | Evol-Instruct applied to code |
| Self-Align / Dromedary (Sun et al.) | 2305.03047 | Principle-driven variant requiring fewer seeds |
| SALMON (Sun et al.) | 2310.05910 | Reward modeling sibling of Self-Align |
| Self-Refine (Madaan et al.) | 2303.17651 | Same-model self-feedback at inference time |