Evol-Instruct

Machine Learning Reinforcement Learning

12 min read

Updated Jun 29, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 29, 2026

Fact-checked

In review queue

Sources

7 citations

Revision

v2 · 2,383 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Evol-Instruct is a method for automatically generating large instruction tuning datasets by prompting a large language model to rewrite, or "evolve," existing instructions into progressively more complex and more diverse variants. It was introduced by Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Qingwei Lin, and Daxin Jiang of Microsoft and Peking University in the paper "WizardLM: Empowering Large Pre-Trained Language Models to Follow Complex Instructions," first released as an arXiv preprint on April 24, 2023 and later accepted at the International Conference on Learning Representations (ICLR) 2024. ^[1] In the original work, Evol-Instruct turned the 52,000 human-written Alpaca seed instructions into roughly 250,000 increasingly complex instructions using ChatGPT as the teacher; the resulting data trained the WizardLM family of open models, and the technique became a widely copied recipe for synthetic data generation in open-source instruction tuning. ^[1]

The paper summarizes its own contribution plainly: "Starting with an initial set of instructions, we use our proposed Evol-Instruct to rewrite them step by step into more complex instructions. Then, we mix all generated instruction data to fine-tune LLaMA. We call the resulting model WizardLM." ^[1]

What is Evol-Instruct?

Evol-Instruct treats a small pool of human-written instructions as seeds and repeatedly mutates them with a strong teacher model. Each pass sends an instruction to the teacher together with an "evolving prompt" that asks the teacher either to make the instruction harder (In-Depth Evolving) or to invent a related but novel instruction (In-Breadth Evolving). The teacher then generates a response to each surviving instruction, a heuristic Instruction Eliminator discards evolutions that failed, and the kept instruction and response pairs are pooled with the originals. Running this loop for several rounds turns roughly fifty thousand simple seed prompts into hundreds of thousands of instructions that span a much wider range of difficulty and topic. A student model is then fine-tuned on the result. ^[1]

Because the teacher is a proprietary model and the student is a smaller open model, Evol-Instruct is a form of knowledge distillation: the capabilities of the teacher are transferred into the student through generated training data rather than through logits or weights. Its central claim is that AI-evolved instructions are not merely cheaper than human-written ones but actually better for teaching complex, multi-step behavior, because a language model can manufacture and control instruction complexity more systematically than human annotators can at scale. The authors report that on a complexity-balanced test bed, "instructions from Evol-Instruct are superior to human-created ones." ^[1]

Why was Evol-Instruct created?

Open instruction-following models in early 2023 were typically trained on data produced by Self-Instruct, the technique behind Stanford's Alpaca, in which a teacher model expands a handful of human tasks into tens of thousands of new ones. ^[2]^[6] That approach scaled the quantity of instructions but not their difficulty: the generated prompts clustered around simple, common requests and rarely contained the multi-constraint, multi-step problems that distinguish a capable assistant from a basic one. Writing such complex instructions by hand is slow, expensive, and hard to do consistently, and human annotators struggle to dial complexity up or down on demand or to cover the long tail of rare task types. ^[1]

Evol-Instruct was designed to fill that gap. Rather than generating instructions from scratch, it starts from an existing dataset and applies controlled, incremental transformations whose explicit purpose is to raise complexity and broaden coverage. The authors argued that a model trained on a complexity-balanced distribution that includes many genuinely hard prompts would follow difficult real-world instructions better than a model trained only on easy ones, and their experiments supported this. ^[1]

How does Evol-Instruct work?

The pipeline has three stages that repeat for a fixed number of rounds M: instruction evolving, response generation, and elimination evolving. In the original work the seed set was the 52,000 instructions from Alpaca, the teacher was OpenAI's ChatGPT (the gpt-3.5-turbo model accessed through Azure), M was set to 4, and the process produced about 250,000 instructions in total. ^[1]

In each round, every instruction is upgraded once. The system randomly selects one of six evolving prompts with equal probability: five belong to In-Depth Evolving and one to In-Breadth Evolving. Every evolving prompt instructs the teacher to rewrite the input into a new prompt that stays reasonable, remains answerable by a human, and adds only a modest amount of text (the paper suggests roughly 10 to 20 words) so that complexity grows gradually instead of exploding in a single step. ^[1]

How do In-Depth and In-Breadth evolving differ?

In-Depth Evolving makes a single instruction harder while keeping it on the same topic. It uses five distinct operations: ^[1]

Operation	What it does
Add constraints	Inserts an additional requirement or restriction the answer must satisfy.
Deepening	Replaces a general query with a more specific or probing version of the same question.
Concretizing	Swaps abstract or generic concepts for concrete, particular ones.
Increase reasoning steps	Rewrites a task that could be solved directly so that it now requires explicit multi-step reasoning.
Complicate input	Adds harder structured input, such as data in a particular format, code, tables, or formulas, for the instruction to operate on.

In-Breadth Evolving is a mutation operation that asks the teacher to produce a completely new instruction inspired by the given one, belonging to the same broad domain but covering a different, rarer, or more specialized task. The paper describes it as "mutation, i.e., generating a completely new instruction based on the given instruction." Its purpose is diversity rather than difficulty: it pushes the dataset to cover long-tailed topics and task types that the seed set never contained, widening the distribution the student model sees. ^[1]

What is Elimination Evolving?

After an instruction is evolved, the teacher generates a response to it, and the elimination step filters out evolutions that failed. The paper defines four conditions under which an evolved instruction is treated as failed and removed: ^[1]

It provides no information gain over the original instruction, judged by asking the teacher (ChatGPT) to compare the two.
The generated response suggests the model could not answer, for example a short reply (fewer than about 80 words) that contains words such as "sorry."
The response consists only of punctuation and stop words, carrying no real content.
The evolved instruction merely copies boilerplate from the evolving prompt, such as the phrases "given prompt" or "rewritten prompt."

Surviving instruction and response pairs are merged back into the data pool, and the next round evolves this enlarged pool again. After four rounds, the WizardLM authors sampled 70,000 examples from the full 250,000 and fine-tuned a LLaMA 7B model on them; the 70,000 figure was chosen to match the data volume used by Vicuna so the comparison would be fair. Later WizardLM releases scaled this recipe to 13B and 30B students and to larger evolved datasets. The whole construction requires only the teacher model and an initial seed set, with no human labeling beyond the seeds, at a cost of several hundred thousand teacher API calls across the four rounds. ^[1]

What models were trained with Evol-Instruct?

Evol-Instruct anchors a family of open models built by the same group and collaborators, each adapting the evolving idea to a different domain.

Model	First release	Base model	Evol-Instruct variant	Headline result
WizardLM	April 2023	LLaMA 7B (later 13B, 30B)	General Evol-Instruct, 250K evolved, 70K used for the 7B	Preferred over ChatGPT by human raters on high-complexity prompts ^[1]
WizardCoder	June 2023	StarCoder 15B (later DeepSeek-Coder)	Code Evol-Instruct, 78K evolved code instructions	57.3 pass@1 on HumanEval; a later 33B V1.1 reached 79.9 ^[3]
WizardMath	August 2023	Llama 2 7B/13B/70B	Reinforced Evol-Instruct (RLEIF)	81.6 pass@1 on GSM8K, 22.7 on MATH for the 70B ^[4]

WizardCoder (Luo et al., 2306.08568, ICLR 2024) specialized the evolving prompts for programming and applied them to the StarCoder base model. Trained on roughly 78,000 evolved code instructions, its 15B model scored 57.3 pass@1 on HumanEval, reported as 22.3 points above the previous open-source state of the art and outperforming several larger closed models available at the time; a later WizardCoder-33B-V1.1 derived from DeepSeek-Coder reached 79.9 pass@1. ^[3]

WizardMath (Luo et al., 2308.09583) extended the idea into Reinforced Evol-Instruct, sometimes abbreviated RLEIF (Reinforcement Learning from Evol-Instruct Feedback). Built on Llama 2, it first evolves math problems both upward (harder) and downward (easier) from GSM8K and MATH, then trains two reward models: an Instruction Reward Model that scores the quality of evolved problems and a process-supervised reward model that scores each intermediate reasoning step, an idea drawn from OpenAI's process-supervision work. The combined rewards drive a proximal policy optimization phase, making WizardMath a hybrid of Evol-Instruct data synthesis and reinforcement learning. WizardMath-70B reached 81.6 pass@1 on GSM8K, reported as 24.8 points above the previous open-source state of the art, and 22.7 on the MATH benchmark. ^[4] In April 2024 Microsoft released WizardLM-2 (a 7B model, a 70B model, and an 8x22B mixture-of-experts model based on Mixtral) trained with a more elaborate, fully synthetic data system descended from Evol-Instruct; the weights were briefly withdrawn shortly after release and then restored once an outstanding toxicity review was completed. ^[7]

How well does Evol-Instruct work?

On the original WizardLM, a GPT-4-based automatic evaluation found that the 7B student reached more than 90 percent of ChatGPT's capability on 17 of 29 tested skills. ^[1] In a blind human evaluation on a complexity-balanced test set, annotators preferred WizardLM's answers to those of Alpaca and Vicuna by wide margins, and on the highest-difficulty portion of the test set (items rated at difficulty 8 or above) they preferred WizardLM to ChatGPT itself (a win rate of 42.9 percent versus 35.0 percent), even though ChatGPT remained ahead overall. ^[1] Trained on the same 70,000-example budget, WizardLM beat Vicuna by 12.4 points (41.3 versus 28.9 percent), which the authors offered as evidence that the gains came from the evolved data's complexity rather than its quantity. ^[1]

Beyond the WizardLM models themselves, Evol-Instruct had a large influence on open-source practice. The released evolved datasets (distributed on the Hugging Face Hub under names such as the 70K and 196K WizardLM_evol_instruct sets) were reused to fine-tune many community models, and the broader "evolve a seed set with a teacher" pattern became a standard tool for data augmentation in instruction tuning, appearing as a component in numerous later synthetic-data pipelines. ^[7]

A direct methodological follow-up, "Automatic Instruction Evolving for Large Language Models" (Zeng et al., 2406.00770, submitted June 2, 2024), introduced Auto Evol-Instruct. It removes the hand-designed evolving rules of the original method: instead of the five fixed In-Depth operations, an optimizer model analyzes a sample of the data, drafts a generic evolving method, then iteratively inspects failures in the evolution trajectories and rewrites the method to fix them, with no human-specified operations at all. The authors reported that this fully automatic version surpassed the original human-designed Evol-Instruct across MT-Bench, AlpacaEval, GSM8K, and HumanEval. ^[5]

What are the limitations of Evol-Instruct?

Evol-Instruct shares the weaknesses of any teacher-distillation pipeline, plus some specific to instruction mutation. ^[1]^[5]

Dependence on the teacher. The student can only learn what the teacher already knows, so the teacher's accuracy, biases, and refusal patterns set a ceiling on quality and are copied into the student. Distilling from a commercial model such as ChatGPT also raises terms-of-service questions when the result is used to build competing systems.
Error and noise propagation. Because both the evolved instruction and its target response come from the teacher, factual mistakes and flawed reasoning enter the training set directly. The Elimination Evolving filter is a set of shallow heuristics (length checks, keyword checks) that catches obvious failures but cannot detect a confident, well-formed, wrong answer.
Difficulty explosion and semantic drift. Repeated evolving can make instructions unreasonable, internally contradictory, or impossible to answer, which is exactly why the prompts cap how much text each step may add; even so, meaning can drift away from anything useful over several rounds.
Diversity bounded by the seed. In-Breadth Evolving widens coverage, but the reachable space is still anchored to the seed distribution, so domains absent from the seeds tend to stay underrepresented.
Verification gap. Unlike methods that filter on a checkable reward, general Evol-Instruct has no notion of a correct answer for most prompts, so it cannot guarantee the responses it trains on are right. WizardMath's reward-model stage and Auto Evol-Instruct's failure-driven refinement are partly responses to this limitation.
Evaluation caveats. Several of the headline numbers rely on GPT-4 or ChatGPT acting as an automatic judge, which can favor outputs stylistically similar to its own and, as the authors themselves noted, does not always agree with human preference on the hardest skills.

References

Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Qingwei Lin, Daxin Jiang. "WizardLM: Empowering Large Pre-Trained Language Models to Follow Complex Instructions." arXiv:2304.12244 (April 24, 2023); ICLR 2024. https://arxiv.org/abs/2304.12244 ↩
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, Hannaneh Hajishirzi. "Self-Instruct: Aligning Language Models with Self-Generated Instructions." arXiv:2212.10560 (2022); ACL 2023. https://arxiv.org/abs/2212.10560 ↩
Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, Daxin Jiang. "WizardCoder: Empowering Code Large Language Models with Evol-Instruct." arXiv:2306.08568 (June 2023); ICLR 2024. https://arxiv.org/abs/2306.08568 ↩
Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, Dongmei Zhang. "WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct." arXiv:2308.09583 (August 21, 2023). https://arxiv.org/abs/2308.09583 ↩
Weihao Zeng, Can Xu, Yingxiu Zhao, Jian-Guang Lou, Weizhu Chen. "Automatic Instruction Evolving for Large Language Models" (Auto Evol-Instruct). arXiv:2406.00770 (June 2, 2024). https://arxiv.org/abs/2406.00770 ↩
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, Tatsunori B. Hashimoto. "Stanford Alpaca: An Instruction-following LLaMA Model." Stanford CRFM, 2023. https://github.com/tatsu-lab/stanford_alpaca ↩
nlpxucan/WizardLM. "WizardLM: Family of instruction-following LLMs powered by Evol-Instruct (WizardLM, WizardCoder, WizardMath)." GitHub repository, including released evol_instruct datasets and the April 2024 WizardLM-2 announcement. https://github.com/nlpxucan/WizardLM ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Instruction Tuning MBPP WildChat

What is Evol-Instruct?

Why was Evol-Instruct created?

How does Evol-Instruct work?

How do In-Depth and In-Breadth evolving differ?

What is Elimination Evolving?

What models were trained with Evol-Instruct?

How well does Evol-Instruct work?

What are the limitations of Evol-Instruct?

References

Improve this article

Related Articles

State (Reinforcement Learning)

State-Action Value Function

Action (Reinforcement Learning)

Bellman Equation

Critic

Deep Q-Network (DQN)

What links here

Related Articles

State (Reinforcement Learning)

State-Action Value Function

Action (Reinforcement Learning)

Bellman Equation

Critic

Deep Q-Network (DQN)

What links here