Evol-Instruct
Last reviewed
Jun 8, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 2,224 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 2,224 words
Add missing citations, update stale details, or suggest a clearer explanation.
Evol-Instruct is a method for automatically building large instruction tuning datasets by using a large language model to rewrite, or "evolve," existing instructions into progressively more complex and more diverse variants. It was introduced in the paper "WizardLM: Empowering Large Pre-Trained Language Models to Follow Complex Instructions" by Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Qingwei Lin, and Daxin Jiang, researchers at Microsoft and Peking University, first released as an arXiv preprint on April 24, 2023 and later accepted at the International Conference on Learning Representations (ICLR) 2024. [1] The data it produces was used to fine-tune the WizardLM family of open models, and the technique became a widely copied recipe for synthetic data generation in open-source instruction tuning.
Evol-Instruct treats a small pool of human-written instructions as seeds and repeatedly mutates them with a strong teacher model. Each pass sends an instruction to the teacher together with an "evolving prompt" that asks the teacher either to make the instruction harder (In-Depth Evolving) or to invent a related but novel instruction (In-Breadth Evolving). The teacher then generates a response to each surviving instruction, a heuristic Instruction Eliminator discards evolutions that failed, and the kept instruction and response pairs are pooled with the originals. Running this loop for several rounds turns roughly fifty thousand simple seed prompts into hundreds of thousands of instructions that span a much wider range of difficulty and topic. A student model is then fine-tuned on the result. [1]
Because the teacher is a proprietary model and the student is a smaller open model, Evol-Instruct is a form of knowledge distillation: the capabilities of the teacher are transferred into the student through generated training data rather than through logits or weights. Its central claim is that AI-evolved instructions are not merely cheaper than human-written ones but actually better for teaching complex, multi-step behavior, because a language model can manufacture and control instruction complexity more systematically than human annotators can at scale. [1]
Open instruction-following models in early 2023 were typically trained on data produced by Self-Instruct, the technique behind Stanford's Alpaca, in which a teacher model expands a handful of human tasks into tens of thousands of new ones. [2][6] That approach scaled the quantity of instructions but not their difficulty: the generated prompts clustered around simple, common requests and rarely contained the multi-constraint, multi-step problems that distinguish a capable assistant from a basic one. Writing such complex instructions by hand is slow, expensive, and hard to do consistently, and human annotators struggle to dial complexity up or down on demand or to cover the long tail of rare task types. [1]
Evol-Instruct was designed to fill that gap. Rather than generating instructions from scratch, it starts from an existing dataset and applies controlled, incremental transformations whose explicit purpose is to raise complexity and broaden coverage. The authors argued that a model trained on a complexity-balanced distribution that includes many genuinely hard prompts would follow difficult real-world instructions better than a model trained only on easy ones, and their experiments supported this. [1]
The pipeline has three stages that repeat for a fixed number of rounds M: instruction evolving, response generation, and elimination evolving. In the original work the seed set was the 52,000 instructions from Alpaca, the teacher was OpenAI's ChatGPT (the gpt-3.5-turbo model accessed through Azure), M was set to 4, and the process produced about 250,000 instructions in total. [1]
In each round, every instruction is upgraded once. The system randomly selects one of six evolving prompts with equal probability: five belong to In-Depth Evolving and one to In-Breadth Evolving. Every evolving prompt instructs the teacher to rewrite the input into a new prompt that stays reasonable, remains answerable by a human, and adds only a modest amount of text (the paper suggests roughly 10 to 20 words) so that complexity grows gradually instead of exploding in a single step. [1]
In-Depth Evolving makes a single instruction harder while keeping it on the same topic. It uses five distinct operations: [1]
| Operation | What it does |
|---|---|
| Add constraints | Inserts an additional requirement or restriction the answer must satisfy. |
| Deepening | Replaces a general query with a more specific or probing version of the same question. |
| Concretizing | Swaps abstract or generic concepts for concrete, particular ones. |
| Increase reasoning steps | Rewrites a task that could be solved directly so that it now requires explicit multi-step reasoning. |
| Complicate input | Adds harder structured input, such as data in a particular format, code, tables, or formulas, for the instruction to operate on. |
In-Breadth Evolving is a mutation operation that asks the teacher to produce a completely new instruction inspired by the given one, belonging to the same broad domain but covering a different, rarer, or more specialized task. Its purpose is diversity rather than difficulty: it pushes the dataset to cover long-tailed topics and task types that the seed set never contained, widening the distribution the student model sees. [1]
After an instruction is evolved, the teacher generates a response to it, and the elimination step filters out evolutions that failed. The paper defines four conditions under which an evolved instruction is treated as failed and removed: [1]
Surviving instruction and response pairs are merged back into the data pool, and the next round evolves this enlarged pool again. After four rounds, the WizardLM authors sampled 70,000 examples from the full 250,000 and fine-tuned a LLaMA 7B model on them; the 70,000 figure was chosen to match the data volume used by Vicuna so the comparison would be fair. Later WizardLM releases scaled this recipe to 13B and 30B students and to larger evolved datasets. The whole construction requires only the teacher model and an initial seed set, with no human labeling beyond the seeds, at a cost of several hundred thousand teacher API calls across the four rounds. [1]
Evol-Instruct anchors a family of open models built by the same group and collaborators, each adapting the evolving idea to a different domain.
| Model | First release | Base model | Evol-Instruct variant | Headline result |
|---|---|---|---|---|
| WizardLM | April 2023 | LLaMA 7B (later 13B, 30B) | General Evol-Instruct, 250K evolved, 70K used for the 7B | Preferred over ChatGPT by human raters on high-complexity prompts [1] |
| WizardCoder | June 2023 | StarCoder 15B (later DeepSeek-Coder) | Code Evol-Instruct | 57.3 pass@1 on HumanEval; a later 33B V1.1 reached 79.9 [3] |
| WizardMath | August 2023 | Llama 2 7B/13B/70B | Reinforced Evol-Instruct (RLEIF) | 81.6 pass@1 on GSM8K, 22.7 on MATH for the 70B [4] |
WizardCoder (Luo et al., 2306.08568, ICLR 2024) specialized the evolving prompts for programming and applied them to the StarCoder base model. Its 15B model scored 57.3 pass@1 on HumanEval, outperforming several larger closed models available at the time, and a later WizardCoder-33B-V1.1 derived from DeepSeek-Coder reached 79.9 pass@1. [3]
WizardMath (Luo et al., 2308.09583) extended the idea into Reinforced Evol-Instruct, sometimes abbreviated RLEIF (Reinforcement Learning from Evol-Instruct Feedback). Built on Llama 2, it first evolves math problems both upward (harder) and downward (easier) from GSM8K and MATH, then trains two reward models: an Instruction Reward Model that scores the quality of evolved problems and a process-supervised reward model that scores each intermediate reasoning step, an idea drawn from OpenAI's process-supervision work. The combined rewards drive a proximal policy optimization phase, making WizardMath a hybrid of Evol-Instruct data synthesis and reinforcement learning. WizardMath-70B reached 81.6 pass@1 on GSM8K and 22.7 on the MATH benchmark. [4] In April 2024 Microsoft released WizardLM-2 (a 7B model, a 70B model, and an 8x22B mixture-of-experts model based on Mixtral) trained with a more elaborate, fully synthetic data system descended from Evol-Instruct; the weights were briefly withdrawn shortly after release and then restored once an outstanding toxicity review was completed. [7]
On the original WizardLM, a GPT-4-based automatic evaluation found that the 7B student reached more than 90 percent of ChatGPT's capability on 17 of 29 tested skills. [1] In a blind human evaluation on a complexity-balanced test set, annotators preferred WizardLM's answers to those of Alpaca and Vicuna by wide margins, and on the highest-difficulty portion of the test set they preferred WizardLM to ChatGPT itself (a win rate of 42.9 percent versus 35.0 percent), even though ChatGPT remained ahead overall. [1] Trained on the same 70,000-example budget, WizardLM beat Vicuna by 12.4 points (41.3 versus 28.9 percent), which the authors offered as evidence that the gains came from the evolved data's complexity rather than its quantity. [1]
Beyond the WizardLM models themselves, Evol-Instruct had a large influence on open-source practice. The released evolved datasets (distributed on the Hugging Face Hub under names such as the 70K and 196K WizardLM_evol_instruct sets) were reused to fine-tune many community models, and the broader "evolve a seed set with a teacher" pattern became a standard tool for data augmentation in instruction tuning, appearing as a component in numerous later synthetic-data pipelines. [7]
A direct methodological follow-up, "Automatic Instruction Evolving for Large Language Models" (Zeng et al., 2406.00770, submitted June 2, 2024), introduced Auto Evol-Instruct. It removes the hand-designed evolving rules of the original method: instead of the five fixed In-Depth operations, an optimizer model analyzes a sample of the data, drafts a generic evolving method, then iteratively inspects failures in the evolution trajectories and rewrites the method to fix them, with no human-specified operations at all. The authors reported that this fully automatic version surpassed the original human-designed Evol-Instruct across MT-Bench, AlpacaEval, GSM8K, and HumanEval. [5]
Evol-Instruct shares the weaknesses of any teacher-distillation pipeline, plus some specific to instruction mutation. [1][5]