# Self-Instruct

> Source: https://aiwiki.ai/wiki/self_instruct
> Updated: 2026-06-27
> Categories: Data & Datasets, Large Language Models, Training & Optimization
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Self-Instruct** is a semi-automated framework for aligning a pretrained [large language model](/wiki/large_language_model) with instruction-following behavior by bootstrapping its training data from the model itself, introduced in December 2022 by Yizhong Wang and collaborators.[1][2] The method begins from a small seed pool of 175 human-written tasks, prompts a vanilla pretrained model (originally [GPT-3](/wiki/gpt-3) "davinci") to generate fresh instructions, inputs, and outputs, filters the result for quality and diversity, and fine-tunes on the resulting corpus of roughly 52,000 instructions.[1][2] When the authors fine-tuned vanilla GPT-3 on the generated data, the model improved by 33 percentage points on Super-NaturalInstructions over the base checkpoint and came within five points of OpenAI's [InstructGPT](/wiki/instructgpt)-001 on a held-out human evaluation, establishing the [synthetic data](/wiki/synthetic_data) recipe that Stanford's [Alpaca](/wiki/alpaca) would later use to train an open instruction-tuned model for under five hundred dollars in API costs.[1][3][9]

The authors summarize the core result directly: "Applying our method to the vanilla GPT3, we demonstrate a 33% absolute improvement over the original model on Super-NaturalInstructions, on par with the performance of InstructGPT-001."[1] The published artifact is a corpus of approximately 52,000 instructions paired with about 82,000 instances, released on GitHub under the Apache 2.0 license.[2] The paper was accepted at the 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023) in Toronto.[4]

## What is Self-Instruct?

| Field | Value |
|---|---|
| Title | Self-Instruct: Aligning Language Models with Self-Generated Instructions |
| Authors | Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, Hannaneh Hajishirzi |
| Affiliations | University of Washington, Tehran Polytechnic, Arizona State University, Allen Institute for AI, Johns Hopkins University |
| First posted | 2022-12-20 (arXiv 2212.10560) |
| Venue | ACL 2023 Long Papers, pages 13484-13508 |
| DOI | 10.18653/v1/2023.acl-long.754 |
| Seed pool | 175 tasks (25 classification, 150 non-classification) |
| Released data | 52,445 instructions, 82,439 instances |
| Base model used | GPT-3 175B ("davinci" engine) |
| Code license | Apache-2.0 |
| Repository | github.com/yizhongw/self-instruct |

The paper's central claim is empirical rather than theoretical: that the latent ability of a sufficiently capable pretrained model to produce plausible task descriptions and worked examples can be reinvested as supervision for itself, replacing expensive human labeling without sacrificing downstream quality.[1] Self-Instruct sits between two prior families of work. On one side stood human-curated instruction collections such as the Public Pool of Prompts feeding T0 and the broader instruction-tuning literature represented by FLAN and the Tk-Instruct line drawn from Super-NaturalInstructions.[1] On the other side stood proprietary pipelines like the one OpenAI used to build InstructGPT, which relied on customer prompts and contractor demonstrations that no outside group could replicate.[5] Self-Instruct splits the difference: 175 hand-written tasks plus the API of a strong but un-aligned base model produces a corpus large enough to push that same base model close to the proprietary frontier.[1]

### ELI5: What is Self-Instruct in simple terms?

Imagine you want to teach a very well-read but unfocused student to follow instructions, but you only have time to write 175 example assignments yourself. Instead of paying tutors to write thousands more, you ask the student to invent new assignments in the same style, answer them, throw away the lazy or repetitive ones, and then study from the pile that remains. That is Self-Instruct: a language model writes its own homework, grades it roughly, keeps the good parts, and learns from them. The surprising finding is that this self-taught student ends up nearly as good at following instructions as one trained on far more expensive human-written material.[1]

## What problem did instruction tuning face before Self-Instruct?

By late 2022 several research lines had converged on the idea that supervised fine-tuning on instruction-formatted data unlocks zero-shot generalization in large language models.[1] BigScience's T0, an 11B-parameter [T5](/wiki/t5)-based model trained on a manually curated prompt collection covering dozens of [natural language processing](/wiki/natural_language_processing) datasets, demonstrated that prompted multitask training enabled new-task transfer.[6] Google's FLAN line and the Allen Institute's Tk-Instruct followed, the latter using the 1,616-task Super-NaturalInstructions benchmark as its training pool.[1] In parallel, OpenAI shipped InstructGPT (also called text-davinci-001 in its earliest deployed form) using a combination of human-written prompt-response demonstrations and [Reinforcement Learning from Human Feedback](/wiki/rlhf).[5]

All of these efforts shared a bottleneck. The instruction corpora were either limited in surface diversity (academic NLP datasets reworded into prompts) or expensive to scale (paid contractor labor at OpenAI's scale).[1] Yizhong Wang and coauthors framed the problem in their abstract as a creativity ceiling: "large 'instruction-tuned' language models, finetuned on instructions and human feedback, have demonstrated a remarkable ability to generalize zero-shot to new tasks. Nevertheless, they depend heavily on human-written instruction data that is limited in quantity, diversity, and creativity, therefore hindering the generality of the tuned model."[1]

### Who created Self-Instruct?

The first author, Yizhong Wang, was a PhD student in the Paul G. Allen School of Computer Science and Engineering at the University of Washington, advised by Hannaneh Hajishirzi and Noah A. Smith, both of whom also held appointments at the Allen Institute for AI.[7] Coauthor Yeganeh Kordi was at Tehran Polytechnic (Amirkabir University of Technology); Swaroop Mishra was at Arizona State University; Alisa Liu was a graduate student at UW; and Daniel Khashabi, previously at AI2, was a faculty member at Johns Hopkins University.[1][7] The intellectual lineage runs through the Allen Institute's earlier work on Natural Instructions and Super-NaturalInstructions, which provided the evaluation benchmark used in the paper.[1]

The first arXiv submission is dated 2022-12-20, six weeks after ChatGPT's launch and less than a month after OpenAI introduced text-davinci-003.[1] A revised v2 was posted 2023-05-25 to align with the ACL camera-ready.[1]

## How does Self-Instruct work?

Self-Instruct is presented in the paper as a four-step pipeline operating on a growing pool of tasks.[1] Each task is a tuple of one natural-language instruction and one or more instances, where an instance contains an input (which may be empty) and the corresponding output. The pipeline starts with 175 hand-crafted tasks and iterates until a target dataset size is reached.[1][2]

### Step 1: How are new instructions generated?

In each iteration the system samples eight instructions from the task pool as in-context demonstrations: six drawn from the original 175 human-written seed tasks and two drawn from previously machine-generated tasks already accepted into the pool.[1][8] These eight examples are concatenated into a prompt that asks the base language model to produce additional task instructions in the same style. The mix of human and machine examples is deliberate: drawing only from human seeds would bias the model toward the seed distribution, while drawing only from earlier generations would allow drift to compound.[1]

The base model used in the original paper is GPT-3's "davinci" engine at 175B parameters.[1][3] In the released code each generation call asks for several instructions at once and parses them out, amortizing the API cost.[2]

### Step 2: How does Self-Instruct identify classification tasks?

A surprising practical finding documented in the paper is that classification tasks (where the output is drawn from a small label set) and open-ended generation tasks require different prompting strategies during instance generation.[1] To route each new instruction correctly, Self-Instruct prompts the base model in a few-shot manner with 12 classification instructions and 19 non-classification instructions taken from the seed pool, asking whether a given new instruction is or is not a classification task.[1][8] The router itself is the same base language model, which makes the entire pipeline runnable from a single API.

### Step 3: How are inputs and outputs generated?

The third step asks the model to produce inputs and outputs for each new instruction. Two prompting templates handle the routing decided in Step 2.[1]

For non-classification tasks the system uses an **input-first** template: the model is asked to first hallucinate a plausible input for the instruction and then to write the matching output. This produces realistic-looking inputs at the cost of letting the model commit to an output it must justify.[1] For classification tasks the input-first approach has a known failure mode: the model tends to overproduce inputs that map to the most common label and ignore minority classes. Self-Instruct therefore uses an **output-first** template for classification: the label is sampled first from the inferred label set, and the input is generated conditional on that label, which balances class coverage.[1]

### Step 4: How does filtering keep the data diverse?

Filtering is the load-bearing component that prevents collapse of diversity over hundreds of thousands of API calls.[1] Three filters operate.

The first is a [ROUGE-L](/wiki/rouge_score) diversity threshold: a new candidate instruction is added to the task pool only if its ROUGE-L similarity with every existing instruction in the pool is less than 0.7.[1][8] This is the mechanism the paper relies on to keep the corpus from collapsing into paraphrases of the same handful of tasks.

The second filter is a heuristic checklist: instructions are rejected if they contain certain keywords typically associated with unanswerable or visual tasks (for example, those asking the model to look at an image or pick up an object), if they are too short or too long, or if the input-output pair simply repeats the instruction.[1] The exact keyword list and the duplicate-input heuristic are visible in the released code.[2]

The third filter is exact-match deduplication against any existing instance with the same input.[1] The combined effect is a sharp drop in apparent corpus size from the raw API output to the final dataset.

### How does the iterative bootstrapping loop converge?

The four steps repeat. Each iteration enlarges the task pool, the pool is sampled to seed the next iteration's prompts, and the process continues until budget exhaustion. The paper reports that the iterative procedure converged on 52,445 unique instructions paired with 82,439 instances, of which 11,584 are classification tasks and 40,861 are non-classification tasks.[1][8]

## What is in the Self-Instruct dataset?

The released corpus, hosted in the public yizhongw/self-instruct repository, contains the model-generated instructions in JSONL form alongside the seed tasks and reformatted fine-tuning splits.[2] The seed file `seed_tasks.jsonl` is the literal 175-task pool that bootstraps the pipeline and was later inherited verbatim by Stanford Alpaca and several other downstream projects.[2][9] The model-generated portion, in `data/gpt3-generations/batch_221203/all_instances_82K.jsonl`, contains the 82K instances; a reformatted version suitable for fine-tuning lives under `data/finetuning/self_instruct_221203`.[2] The repository also ships 252 expert-written user-oriented evaluation instructions covering 119 different domains, used for the paper's human evaluation.[2] The code and data carry the Apache-2.0 license.[2]

The authors annotated 200 random instructions to estimate the quality of the corpus and reported that 46 percent of the data points had one or more problems, ranging from incorrect outputs to instructions paired with unrelated inputs.[2] They recommended caution and explicitly invited the community to develop improved filtering methods.[2]

## How well did Self-Instruct perform?

The paper evaluates Self-Instruct on two settings: the Super-NaturalInstructions benchmark, which measures generalization to a held-out set of NLP tasks, and a hand-curated set of 252 user-oriented prompts evaluated by human raters.[1]

### How big was the gain on Super-NaturalInstructions?

On the Super-NaturalInstructions test split, the paper reports the following ROUGE-L scores, which are the SuperNI standard metric.[1][8]

| Model | Parameters | SuperNI ROUGE-L |
|---|---|---|
| Vanilla [GPT-3](/wiki/gpt-3) (davinci) | 175B | 6.8 |
| T0 | 11B | 33.1 |
| GPT-3 + Self-Instruct | 175B | 39.9 |
| [InstructGPT](/wiki/instructgpt)-001 | 175B | 40.8 |

The headline number from the paper is the 33-point absolute improvement that Self-Instruct training delivered over vanilla GPT-3, and the fact that the resulting open recipe came within 0.9 points of OpenAI's then-current InstructGPT-001.[1][3] T0 at 11B parameters is the closest publicly available instruction-tuned model in the comparison and is substantially smaller, although the comparison across parameter counts is not strictly apples to apples.[1]

### How did human raters judge Self-Instruct?

For tasks more representative of real user prompts the authors collected 252 instructions over 119 application domains and asked four human annotators to rate each model's responses on a four-point scale (A best to D worst).[1] Human raters preferred GPT-3 trained with Self-Instruct to GPT-3 fine-tuned on the T0 training data or on the Super-NaturalInstructions training set by a wide margin, and the paper reports the recipe "leaving only a 5% absolute gap behind InstructGPT-001" on the held-out human evaluation.[1] The authors interpreted this gap as evidence that self-generated data can substantially close, although not eliminate, the distance to closed-source proprietary alignment pipelines.[1]

### What were the fine-tuning details?

The supervised fine-tuning was performed via the OpenAI fine-tuning API on the "davinci" engine, using two epochs with default hyperparameters except that the prompt-loss weight was set to zero so that the loss was computed only on the target outputs.[1][3] The paper does not report a dollar figure for the fine-tuning itself; the cost of the data generation step was approximately six hundred dollars in OpenAI credits according to community write-ups, with Stanford Alpaca later replicating a comparable pipeline for under five hundred dollars using text-davinci-003.[9]

## What impact did Self-Instruct have?

Self-Instruct moved very quickly from research artifact to recipe. Within four months of the arXiv release, the same 175 seed tasks were the starting point for a sequence of widely used open instruction-tuned models, and the underlying pipeline diversified into several research subgenera.

### How did Stanford Alpaca use Self-Instruct?

The most visible adopter was Stanford's CRFM Alpaca project, released 2023-03-13. Alpaca explicitly adopts the Self-Instruct pipeline. The Stanford team wrote that it "started with the 175 human-written instruction-output pairs from the self-instruct seed set" and then used the [OpenAI API](/wiki/openai_api) to prompt OpenAI's text-davinci-003 for additional examples, choosing the newer model over the davinci endpoint that the original paper had used.[9] Three modifications to the pipeline are documented: a clearer batch-generation prompt that asks for twenty instructions at a time, removal of the classification-task vs non-classification-task distinction with a single unified prompt, and producing only one instance per instruction rather than two or three.[9] The output is a corpus of 52,000 examples (the alpaca_data.json file) generated for under five hundred dollars in API costs, used to fine-tune a [LLaMA](/wiki/llama) 7B base model on eight A100 GPUs in roughly three hours.[9]

Alpaca's dataset became a de facto benchmark in its own right, and its quirks (including frequent references to a January 2023 knowledge cutoff inherited from text-davinci-003) propagated into many derivative datasets.[9]

### How does Self-Instruct relate to Vicuna and ShareGPT data?

[Vicuna](/wiki/vicuna), released in March 2023 by a team from UC Berkeley, CMU, Stanford, UC San Diego, and Mohamed bin Zayed University, fine-tuned LLaMA on roughly seventy thousand conversations scraped from ShareGPT user uploads.[10] Although Vicuna's data source is different from Self-Instruct's (human-shared dialogues rather than seed-based bootstrapping), the underlying philosophy of training an open model on instruction-style data harvested from a stronger system is a direct descendant of the Self-Instruct paradigm.[10]

### How did Dolly 2.0 respond to Self-Instruct?

[Databricks](/wiki/databricks) took the opposite turn with Dolly 2.0, released 2023-04-12. The accompanying databricks-dolly-15k dataset is a 15,000-example human-generated instruction corpus contributed by more than five thousand Databricks employees over March and April 2023.[11] Dolly was explicitly positioned as a commercially licensable alternative to the OpenAI-generated Alpaca data, whose terms of use prevented using it to train models competing with OpenAI's services.[11] In that sense Dolly 2.0 is both a complement to and a critique of the Self-Instruct synthetic-data paradigm: it shows that 15K hand-written examples can produce a usable instruction-tuned model without depending on a proprietary distillation source.[11]

### How was Self-Instruct adapted for code?

Code Alpaca, released by Sahil Chaudhary in March 2023, adapted the Self-Instruct pipeline to the code domain. Its 20K-example dataset (`code_alpaca_20k.json`) was generated by prompting text-davinci-003 with seed tasks rewritten to focus on code generation, editing, and optimization, using essentially the Alpaca pipeline minus the classification distinction.[12] Code Alpaca's training data cost was under two hundred dollars.[12]

### How did CAMEL generalize the bootstrapping idea?

The CAMEL project (Communicative Agents for "Mind" Exploration of Large Language Model Society), arXiv:2303.17760, generalized the bootstrapping idea to multi-agent role-play. Instead of seeding from 175 human-written tasks, CAMEL used inception prompting to have one chat agent take the role of a user and another the role of an assistant, producing conversational instruction data covering software engineering and other domains.[13]

### What is Evol-Instruct and how does it extend Self-Instruct?

WizardLM, introduced by Xu et al. in April 2023 (arXiv:2304.12244, ICLR 2024), proposed Evol-Instruct as a successor to Self-Instruct's diversity heuristics.[14] Where Self-Instruct relies on the base model's stylistic variation across prompts and the ROUGE-L threshold to grow the corpus, Evol-Instruct uses a small fixed set of "evolution" operators (adding constraints, increasing reasoning depth, concretizing abstract steps, and so on) that an LLM applies repeatedly to existing instructions to make them more complex.[14] The paper reports that WizardLM, trained on Evol-Instruct data, outperforms Alpaca and Vicuna on a complex-instruction benchmark.[14] The follow-up WizardCoder (arXiv:2306.08568) ports the same idea to the code domain.[15]

### How does Self-Align (Dromedary) build on Self-Instruct?

A separate strand, Principle-Driven Self-Alignment by Sun et al. (arXiv:2305.03047), uses Self-Instruct-style bootstrapping not to generate task data but to instantiate a small handcrafted set of behavioral principles into a much larger training corpus.[16] The resulting Dromedary model, trained from a base LLaMA, requires only six in-context exemplars and thirty-one principles of human supervision, and the same authors later extended the approach with SALMON (arXiv:2310.05910), which trains a reward model on principle-following responses.[17]

### How does Self-Refine relate to Self-Instruct?

[Knowledge Distillation](/wiki/knowledge_distillation)-adjacent uses of the same intuition include Self-Refine (Madaan et al., arXiv:2303.17651, NeurIPS 2023), which uses the same underlying language model to generate, critique, and revise its outputs in a refinement loop.[18] Self-Refine is not a data-generation pipeline like Self-Instruct, but it shares the "model-as-its-own-supervisor" principle and is frequently cited alongside it as a sibling technique.[18]

### How did AI2 use Self-Instruct data in Tulu?

The Allen Institute for AI, where several of the Self-Instruct authors are affiliated, ran a sequence of open-instruct experiments and released a series of [Tulu](/wiki/tulu_3) models trained on mixtures of human-written and Self-Instruct-style synthetic data, providing some of the most carefully ablated evidence that a fraction of bootstrapped instructions contributes meaningfully to downstream performance.[7]

## Why was Self-Instruct significant?

Self-Instruct's influence on the open-source [instruction tuning](/wiki/instruction_tuning) ecosystem is difficult to overstate. By the spring of 2023 essentially every open instruction-tuned model that did not depend on user-contributed conversations was trained on data generated by Self-Instruct or a Self-Instruct derivative.[9][12][14] The Alpaca dataset alone was forked thousands of times and translated into dozens of languages.[9]

Three structural contributions stand out.

First, the paper demonstrated that the 175-seed-task threshold is sufficient. The viability of bootstrapping at this scale established that a small research group, without access to a labeled corpus or a fleet of contractors, could produce instruction data competitive with what frontier labs were generating internally.[1] This shifted the cost structure of alignment work and is a precondition for the wave of small-budget instruction-tuned models that followed.[9][11]

Second, Self-Instruct made the role of [synthetic data](/wiki/synthetic_data) in alignment legible. Earlier instruction-tuning work had treated data as a fixed asset (curated by humans or harvested from existing datasets); Self-Instruct made the data-generation pipeline itself an object of research, opening the door to follow-on work on filtering, evolution operators, and principle-driven generation.[14][16]

Third, the published [in-context learning](/wiki/in_context_learning) format used in Self-Instruct's generation prompts (eight demonstrations, six from human pool and two from machine pool) became a small template that subsequent open-source pipelines, including Alpaca's, copied or simplified.[9]

## What are the limitations of Self-Instruct?

The paper and the associated GitHub release are unusually candid about limitations.

### How good is the generated data?

The authors annotated a random sample of two hundred generated instructions and found that 46 percent had at least one problem: the instruction could be ambiguous, the input could mismatch the instruction, or the output could be incorrect.[2] They note that "most of the generated instructions are meaningful, while the generated instances may contain more noise (to a reasonable extent)" and recommend caution when using the corpus.[2]

### Why does the diversity filter saturate?

The ROUGE-L 0.7 threshold has a known failure mode: it controls surface-form diversity but not semantic diversity, so two instructions that paraphrase the same underlying task in different vocabulary can both pass.[1] As the corpus grows the proportion of new candidate instructions accepted shrinks, which both limits the practical scale of a Self-Instruct run and means later iterations spend most of their cost generating duplicates that are then discarded.[1]

### Does Self-Instruct propagate model bias?

Because all generated examples descend from the base model's prior, Self-Instruct inherits the biases of GPT-3 in both topic distribution and stylistic register.[1] The paper acknowledges that the long tail of rare instruction types is underrepresented, since these are precisely the cases for which the base model has weak priors.[1] Critics have argued that this risks reinforcing the existing distribution of language model behavior rather than expanding it.

### Is Self-Instruct a form of distillation?

A more general critique that emerged after Stanford Alpaca's release is that Self-Instruct, when run against a proprietary model such as text-davinci-003 or [ChatGPT](/wiki/chatgpt), functionally distills the proprietary model into the open one without paying the original training cost.[9] This raises both legal questions about OpenAI's terms of use (which Alpaca's release page explicitly flagged) and methodological questions about whether benchmarks that reward proprietary-style outputs are measuring genuine instruction-following or imitation of a particular vendor.[9]

### Is the evaluation metric reliable?

ROUGE-L on Super-NaturalInstructions is the metric reported in the paper, but the broader community has documented that ROUGE-L is noisy at distinguishing instruction-following quality and can be gamed by output length and surface-form similarity.[1] Subsequent evaluation harnesses such as [AlpacaEval](/wiki/alpacaeval) and human-vote arenas were developed in part to address this shortcoming.

## Related work

| Work | arXiv | Relation |
|---|---|---|
| [InstructGPT](/wiki/instructgpt) (Ouyang et al.) | 2203.02155 | Proprietary baseline that Self-Instruct approaches with open data |
| T0 (Sanh et al.) | 2110.08207 | Earlier instruction-tuned model used as a non-proprietary comparison |
| Alpaca (Stanford CRFM) | n/a (blog) | Direct application of Self-Instruct using text-davinci-003 |
| [Vicuna](/wiki/vicuna) (LMSys et al.) | n/a (blog) | Instruction tuning on ShareGPT conversations rather than seed-based bootstrap |
| Dolly 2.0 ([Databricks](/wiki/databricks)) | n/a (blog) | Human-written counterpoint to synthetic instruction data |
| Code Alpaca (Chaudhary) | n/a (repo) | Self-Instruct pipeline restricted to code-domain seeds |
| CAMEL (Li et al.) | 2303.17760 | Multi-agent role-play generalization |
| WizardLM Evol-Instruct (Xu et al.) | 2304.12244 | Replaces ROUGE-L diversity with evolution operators |
| WizardCoder (Luo et al.) | 2306.08568 | Evol-Instruct applied to code |
| Self-Align / Dromedary (Sun et al.) | 2305.03047 | Principle-driven variant requiring fewer seeds |
| SALMON (Sun et al.) | 2310.05910 | Reward modeling sibling of Self-Align |
| Self-Refine (Madaan et al.) | 2303.17651 | Same-model self-feedback at inference time |

## See also

- [WRAP (Web Rephrase Augmented Pre-training)](/wiki/rephrase_the_web)
- [Instruction Tuning](/wiki/instruction_tuning)
- [Supervised fine-tuning](/wiki/supervised_fine-tuning)
- [Reinforcement Learning from Human Feedback](/wiki/rlhf)
- [InstructGPT](/wiki/instructgpt)
- [GPT-3](/wiki/gpt-3)
- [Vicuna (language model)](/wiki/vicuna)
- [AlpacaEval](/wiki/alpacaeval)
- [Synthetic data](/wiki/synthetic_data)
- [Knowledge Distillation](/wiki/knowledge_distillation)
- [Constitutional AI](/wiki/constitutional_ai)
- [In-Context Learning](/wiki/in_context_learning)
- [ROUGE](/wiki/rouge_score)
- [Allen Institute for AI](/wiki/allen_institute_for_ai)
- [Databricks](/wiki/databricks)
- [LLaMA](/wiki/llama)
- [Tulu 3](/wiki/tulu_3)

## References

[1] Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, Hannaneh Hajishirzi, "Self-Instruct: Aligning Language Models with Self-Generated Instructions", arXiv, 2022-12-20 (v2 2023-05-25). https://arxiv.org/abs/2212.10560. Accessed 2026-06-21.

[2] Yizhong Wang et al., "yizhongw/self-instruct: Aligning pretrained language models with instruction data generated by themselves", GitHub, 2022-12-20. https://github.com/yizhongw/self-instruct. Accessed 2026-06-21.

[3] Yizhong Wang et al., "Self-Instruct: Aligning Language Models with Self-Generated Instructions (PDF)", arXiv, 2022-12-20. https://arxiv.org/pdf/2212.10560. Accessed 2026-06-21.

[4] Yizhong Wang et al., "Self-Instruct: Aligning Language Models with Self-Generated Instructions", Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13484-13508, ACL Anthology, 2023-07. https://aclanthology.org/2023.acl-long.754/. Accessed 2026-06-21.

[5] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe, "Training language models to follow instructions with human feedback", arXiv, 2022-03-04. https://arxiv.org/abs/2203.02155. Accessed 2026-06-21.

[6] Victor Sanh et al., "Multitask Prompted Training Enables Zero-Shot Task Generalization", arXiv, 2021-10-15 (ICLR 2022). https://arxiv.org/abs/2110.08207. Accessed 2026-06-21.

[7] Yizhong Wang, "Curriculum Vitae", yizhong-wang.com, 2024. https://yizhong-wang.com/assets/cv_yizhongw.pdf. Accessed 2026-06-21.

[8] Yizhong Wang et al., "Self-Instruct: Aligning Language Models with Self-Generated Instructions (HTML version)", ar5iv, 2023. https://ar5iv.labs.arxiv.org/html/2212.10560. Accessed 2026-06-21.

[9] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, Tatsunori B. Hashimoto, "Alpaca: A Strong, Replicable Instruction-Following Model", Stanford CRFM blog, 2023-03-13. https://crfm.stanford.edu/2023/03/13/alpaca.html. Accessed 2026-06-21.

[10] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, Eric P. Xing, "Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90% ChatGPT Quality", LMSYS, 2023-03-30. https://lmsys.org/blog/2023-03-30-vicuna/. Accessed 2026-06-21.

[11] Mike Conover, Matt Hayes, Ankit Mathur, Jianwei Xie, Jun Wan, Sam Shah, Ali Ghodsi, Patrick Wendell, Matei Zaharia, Reynold Xin, "Free Dolly: Introducing the World's First Truly Open Instruction-Tuned LLM", Databricks blog, 2023-04-12. https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm. Accessed 2026-06-21.

[12] Sahil Chaudhary, "Code Alpaca: An Instruction-following LLaMA model for code generation", GitHub, 2023-03-27. https://github.com/sahil280114/codealpaca. Accessed 2026-06-21.

[13] Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, Bernard Ghanem, "CAMEL: Communicative Agents for 'Mind' Exploration of Large Language Model Society", arXiv, 2023-03-31. https://arxiv.org/abs/2303.17760. Accessed 2026-06-21.

[14] Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Daxin Jiang, "WizardLM: Empowering Large Pre-Trained Language Models to Follow Complex Instructions", arXiv, 2023-04-24 (ICLR 2024). https://arxiv.org/abs/2304.12244. Accessed 2026-06-21.

[15] Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, Daxin Jiang, "WizardCoder: Empowering Code Large Language Models with Evol-Instruct", arXiv, 2023-06-14. https://arxiv.org/abs/2306.08568. Accessed 2026-06-21.

[16] Zhiqing Sun, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David Cox, Yiming Yang, Chuang Gan, "Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision", arXiv, 2023-05-04. https://arxiv.org/abs/2305.03047. Accessed 2026-06-21.

[17] Zhiqing Sun, Yikang Shen, Hongxin Zhang, Qinhong Zhou, Zhenfang Chen, David Cox, Yiming Yang, Chuang Gan, "SALMON: Self-Alignment with Principle-Following Reward Models", arXiv, 2023-10-09. https://arxiv.org/abs/2310.05910. Accessed 2026-06-21.

[18] Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Sean Welleck, Bodhisattwa Prasad Majumder, Shashank Gupta, Amir Yazdanbakhsh, Peter Clark, "Self-Refine: Iterative Refinement with Self-Feedback", arXiv, 2023-03-30 (NeurIPS 2023). https://arxiv.org/abs/2303.17651. Accessed 2026-06-21.