WizardLM
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,003 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,003 words
Add missing citations, update stale details, or suggest a clearer explanation.
WizardLM is a family of open-weights instruction-tuned LLaMA-derived large language models and an associated data-synthesis methodology, both produced by a research group at Microsoft led by Can Xu. The defining contribution is Evol-Instruct, an algorithm that uses a strong teacher language model to iteratively rewrite an existing seed instruction dataset into more complex and more diverse instructions through a small fixed set of "in-depth" and "in-breadth" mutation operators.[^1] Fine-tuning open base models on the resulting Evol-Instruct corpora produced WizardLM (general chat), WizardCoder (code generation), and WizardMath (mathematical reasoning), each of which posted state-of-the-art open-source numbers on contemporaneous evaluations such as MT-Bench, AlpacaEval, HumanEval, and GSM8K when released between April 2023 and the end of 2023.[^1][^2][^3] In April 2024 the same group released WizardLM-2 in three sizes (7B, 70B, 8x22B); the announcement was retracted within roughly a day after the team disclosed that mandatory toxicity testing had been skipped, although weights that had been mirrored elsewhere remained accessible.[^4][^5] In May 2025, core members of the WizardLM team left Microsoft for the Hunyuan group at Tencent.[^6]
The original WizardLM paper, "WizardLM: Empowering Large Pre-Trained Language Models to Follow Complex Instructions," was posted to arXiv on 24 April 2023 by Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Qingwei Lin, and Daxin Jiang.[^1] Eight of the nine authors were affiliated with Microsoft (predominantly Microsoft's STCA/Beijing organization), with Jiazhan Feng listed as a Peking University collaborator.[^1] The paper was subsequently accepted at ICLR 2024.[^7] Can Xu, then a senior researcher at Microsoft, is the public face of the group, and the team's GitHub organization (nlpxucan) hosts the code, datasets, and project pages for all WizardLM, WizardCoder, and WizardMath releases.[^8]
The group's research arc through 2023 followed a single repeated pattern: take a publicly available base model (LLaMA, then StarCoder, then Llama-2), construct an Evol-Instruct dataset tailored to the target domain, fine-tune the base model, and publish the resulting checkpoints under names of the form "Wizard*". The same engineering substrate was reused for general dialogue (WizardLM), code (WizardCoder), and math (WizardMath), and the resulting models repeatedly held the top open-source position on their respective benchmarks at release time.[^1][^2][^3]
Evol-Instruct is the central technical contribution and the source of the project's name. It addresses a specific bottleneck in instruction tuning: human-authored or self-generated instruction datasets (for example, the 52K synthetic instructions used by Stanford Alpaca) tend to skew toward simple, surface-level requests and rarely include the long, constraint-heavy queries that exercise reasoning ability.[^1] Rather than ask human annotators to write harder prompts, Evol-Instruct asks a strong language model to rewrite easy prompts into harder ones, then loops.
The algorithm starts from a seed instruction dataset; in the original paper this was the 52K-instance Alpaca dataset.[^1] At each evolution epoch, every instruction in the working set is rewritten by querying ChatGPT with one of six fixed "evolution" prompts chosen with equal probability; the rewritten instruction is then sent back to ChatGPT to generate a fresh response, and the (instruction, response) pair is added to a candidate pool.[^1] An elimination step discards evolved instructions that (a) the rewriter judged unanswerable, (b) closely paraphrase the parent, or (c) collapse to an empty or trivial answer.[^1] After four evolution epochs the working set grew from 52K to roughly 250K (instruction, response) pairs; for fair comparison with Vicuna, a 70K subset was sampled for fine-tuning the released WizardLM-7B.[^1]
Five of the six rewriting prompts are "in-depth" operators, each designed to make a prompt harder without changing its topic:[^1]
The sixth operator is the "in-breadth" mutation. Instead of rewriting the prompt, it asks the teacher model to produce a new prompt from the same domain that is rarer but of comparable length and difficulty.[^1] In-breadth evolution exists to combat topical collapse: applying only the five in-depth operators tends to push every prompt toward longer, more constrained variants of the same handful of starting topics, while in-breadth mutation broadens the topic distribution.
The paper's central empirical claim is that Evol-Instruct shifts the difficulty distribution of the instruction corpus to the right. Using a GPT-4-as-judge difficulty score, the authors show that Alpaca instructions concentrate at low complexity, ShareGPT/Vicuna instructions span a wider but still mid-heavy range, and Evol-Instruct's four-epoch output has substantial mass at the high-difficulty tail.[^1] When LLaMA-7B is fine-tuned on the Evol-Instruct 70K subset, blind pairwise human comparison on a 218-prompt 29-skill test bed shows that annotators preferred WizardLM-7B over ChatGPT outputs on the highest-difficulty bucket (difficulty score >= 8), while ChatGPT retained an advantage on easy prompts.[^1] GPT-4-based automatic scoring on the same set placed WizardLM-7B at roughly 90 percent of ChatGPT's quality on 17 of 29 skill categories.[^1]
The first released checkpoint, WizardLM-7B-V1.0, used LLaMA-7B as the base model and was trained for three epochs on the 70K Evol-Instruct subset using eight V100 GPUs with DeepSpeed ZeRO-3, learning rate 2 x 10^-5, batch size 8, and a maximum context length of 2048 tokens; total wall time was approximately 70 hours.[^1] The original model card and weights were released under the LLaMA research license, with the Evol-Instruct dataset itself published separately on Hugging Face (WizardLM_evol_instruct_70k).[^8]
Two Evol-Instruct corpora were released publicly through the team's Hugging Face organization, WizardLMTeam:[^8]
| Dataset | Size | Seed | Notes |
|---|---|---|---|
| WizardLM_evol_instruct_70k | 70K (instruction, response) pairs | Stanford Alpaca 52K | Subset used to train WizardLM-7B-V1.0 |
| WizardLM_evol_instruct_V2_196k | 196K pairs | Mixed (Alpaca + ShareGPT seeds) | Used for V1.1/V1.2 chat models |
Code Alpaca's 20K instruction set was the seed for the WizardCoder evolution run, which produced roughly 78K (code prompt, code response) pairs after three rounds of evolution; that corpus was not released as a standalone dataset but underlies the WizardCoder checkpoints.[^2] The WizardMath release similarly built a math-specific Evol-Instruct corpus, plus a process-supervision corpus for reward modelling, neither of which was released as a separate file.[^3]
After the initial WizardLM-7B paper, the group released a sequence of larger and revised chat checkpoints. The Hugging Face WizardLMTeam organization lists ten models in the Wizard line (excluding WizardLM-2, which was uploaded then removed from Microsoft channels), of which four are general-purpose chat models.[^8]
| Model | Base model | Release | Reported MT-Bench | Reported AlpacaEval win rate |
|---|---|---|---|---|
| WizardLM-7B-V1.0 | LLaMA 7B | April 2023 | not released with paper | not released with paper |
| WizardLM-13B-V1.0 | LLaMA 13B | Sep 2023 (HF update) | not centrally reported | not centrally reported |
| WizardLM-13B-V1.1 | LLaMA 13B | Jul 2023 | comparable to Llama-2-13B-Chat (~6.77 reported by the team) | not centrally reported |
| WizardLM-13B-V1.2 | Llama 2 13B | Sep 2023 | 7.06 | 89.17% |
| WizardLM-30B-V1.0 | LLaMA 30B | June 2023 | not centrally reported | not centrally reported |
| WizardLM-70B-V1.0 | Llama 2 70B | Aug 2023 | not centrally reported | not centrally reported |
WizardLM-13B-V1.2's MT-Bench score of 7.06 and AlpacaEval win rate of 89.17 percent are taken from the model card on Hugging Face; the same card lists a HumanEval pass@1 of 36.6 percent despite the model not being specifically code-tuned, illustrating positive transfer from the broader Evol-Instruct distribution.[^9] WizardLM-30B-V1.0 was widely cited on the AlpacaEval and Hugging Face Open LLM leaderboards in mid-2023 as the strongest open chat model under 65 billion parameters at the time, before being eclipsed in autumn 2023 by Llama-2-based fine-tunes (including WizardLM's own V1.2 and 70B checkpoints).[^8] The chat models inherit their licensing constraint from their base: LLaMA-1 derivatives (V1.0, V1.1, 30B) are governed by Meta's original non-commercial LLaMA research license, while WizardLM-13B-V1.2 and WizardLM-70B-V1.0 inherit the Llama 2 Community License.[^9]
The chat models adopted the Vicuna prompt template (a USER: / ASSISTANT: turn-based wrapper terminated by </s>), reflecting Vicuna (language model)'s prior establishment as the de facto reference open chat model.[^9]
"WizardCoder: Empowering Code Large Language Models with Evol-Instruct," arXiv:2306.08568, was posted on 14 June 2023 by Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang.[^2] The paper applies Evol-Instruct to code instructions, using Code Alpaca's 20K examples as the seed and StarCoder-15B as the base model.[^2]
Code-specific evolution reuses the five in-depth operators of Evol-Instruct with prompt wording tailored to code (for example, "Add Constraints" becomes a directive to add a runtime, memory, or API restriction, and "Increase Reasoning Steps" instructs the rewriter to require explicit algorithmic justification before code).[^2] Three rounds of evolution produced approximately 78K (instruction, code-response) pairs. WizardCoder-15B was fine-tuned from StarCoder-15B on this dataset with batch size 512 for 200 fine-tuning steps.[^2]
The paper's headline result, on the HumanEval pass@1 metric, was 57.3 percent for WizardCoder-15B versus 35.0 percent for the StarCoder-15B base, a 22.3-point improvement; on MBPP pass@1 the model reported 51.8 percent versus 43.6 percent for StarCoder.[^2] The paper explicitly compared against contemporary proprietary code models, reporting WizardCoder-15B above Anthropic's Claude-Plus (53.0 percent HumanEval pass@1) and Google Bard (44.5 percent), while remaining below GPT-4's 67.0 percent on the same benchmark.[^2]
After the original 15B model, the team published a series of WizardCoder variants based on different code base models:[^8]
| Checkpoint | Base | HF release |
|---|---|---|
| WizardCoder-15B-V1.0 | StarCoder-15B | Jun 2023 |
| WizardCoder-Python-7B-V1.0 | Code Llama Python 7B | Aug 2023 |
| WizardCoder-Python-13B-V1.0 | Code Llama Python 13B | Aug 2023 |
| WizardCoder-Python-34B-V1.0 | Code Llama Python 34B | Aug 2023 |
| WizardCoder-33B-V1.1 | DeepSeek-Coder-33B-Base | Jan 2024 |
WizardCoder-33B-V1.1, trained on top of the DeepSeek-Coder-33B-Base model, was reported to reach 79.9 percent HumanEval pass@1, briefly making it the highest-scoring openly distributed code model on that benchmark when released.[^8] WizardCoder was accepted at ICLR 2024.[^7]
"WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct," arXiv:2308.09583, was posted on 18 August 2023 by Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, Yansong Tang, and Dongmei Zhang.[^3] It introduces a training pipeline called Reinforcement Learning from Evol-Instruct Feedback (RLEIF), which combines Evol-Instruct-style data synthesis with a two-headed reward model and Proximal Policy Optimization (PPO).[^3]
The first stage applies a modified Evol-Instruct to math word problems, with two directions of evolution: "upward" evolution increases difficulty (more constraints, more reasoning steps, harder numbers), and "downward" evolution rewrites instructions toward simpler variants to diversify the distribution.[^3] Roughly 96K evolved math problems were generated across eight iteration cycles.[^3]
RLEIF trains two reward models:[^3]
In the PPO stage, the math policy is rewarded by a product of the IRM score (judging the quality of the input instruction context) and the PRM score (judging the quality of the model's own step-by-step solution), with the combined signal driving the PPO update.[^3] RLEIF is conceptually a process-supervised variant of Reinforcement Learning from Human Feedback (RLHF) in which both the supervision signal and the data are produced by AI rather than humans, placing it close to RLAIF in spirit while predating that label's broad adoption.
WizardMath was released initially as a Llama-2 fine-tune at three scales. Reported pass@1 numbers from the paper (and the accompanying project page) are:[^3][^10]
| Model | GSM8K pass@1 | MATH pass@1 |
|---|---|---|
| WizardMath-7B-V1.0 (Llama-2 7B) | 54.9% | 10.7% |
| WizardMath-13B-V1.0 (Llama-2 13B) | 63.9% | 14.0% |
| WizardMath-70B-V1.0 (Llama-2 70B) | 81.6% | 22.7% |
| ChatGPT (3.5) reference | 80.8% | 34.1% |
| Claude-2 reference | 88.0% | not reported |
| PaLM-2 540B reference | 80.7% | 34.3% |
| Minerva 540B reference | 58.8% | 33.6% |
| GPT-4 reference | 92.0% | 42.5% |
The 70B model thus matched ChatGPT and the much larger PaLM-2 on GSM8K grade-school math while remaining well below GPT-4 on the harder MATH dataset.[^3] A later WizardMath-7B-V1.1 (December 2023), trained from Mistral 7B rather than Llama-2 7B, was reported on the model card to reach 83.2 percent GSM8K pass@1, surpassing the original 70B Llama-2-based model on that benchmark with one-tenth the parameter count.[^8] WizardMath was accepted at ICLR 2025 as an Oral.[^11]
On 15 April 2024, the WizardLM team announced WizardLM-2, presented on the project page wizardlm.github.io/WizardLM2/ as a three-model family.[^4]
| Model | Base | License (announced) | Headline claim |
|---|---|---|---|
| WizardLM-2 7B | Mistral-style 7B base | Apache 2.0 | "performance comparable with existing 10x larger" open models |
| WizardLM-2 70B | Llama 2 70B | Llama 2 Community | "top-tier reasoning" |
| WizardLM-2 8x22B | Mixtral 8x22B | Apache 2.0 | first open model with MT-Bench score over 9.00 |
The announcement attributed the gains to a synthetic data and post-training pipeline branded "AI Align AI" (AAA), which the team described as multiple state-of-the-art models co-teaching and self-teaching each other, combined with an "Evol Lab" (improved Evol-Instruct plus an Evol-Answer step), Stage-DPO progressive preference learning (DPO applied in successive curriculum stages), and the RLEIF objective inherited from WizardMath.[^4]
On blind pairwise human preference comparisons reported by the team, WizardLM-2 8x22B was placed slightly below GPT-4-1106-preview but above Command R+ and GPT-4-0314; WizardLM-2 70B was placed above GPT-4-0613, Mistral-Large, and Qwen1.5-72B-Chat; WizardLM-2 7B was placed at the level of Qwen1.5-32B-Chat.[^4] On the MT-Bench automated metric, WizardLM-2 8x22B reportedly became the first openly distributed model to break the 9.00 score barrier.[^12]
Within roughly a day of the announcement, the official WizardLM-2 weights and project page were taken down from Microsoft-controlled channels.[^5][^13] On 16 April 2024 the team's X (Twitter) account posted an apology stating that "we accidentally missed an item required in the model release process, toxicity testing"; the team said the omitted tests would be completed and the models re-released afterwards.[^5][^13] Reporting at the time noted that The Information had first highlighted the absence of toxicity testing, prompting the takedown.[^13] Because the weights had been mirrored to community-controlled Hugging Face repositories and to GitHub during the short public window, third-party copies of WizardLM-2 8x22B and 7B continued to circulate even after the official removal.[^14] As of the date of writing, Microsoft has not formally re-released WizardLM-2 through its own channels.
In May 2025, Can Xu and Qingfeng Sun, identified as the public leads of the WizardLM project, announced on X that the WizardLM team had left Microsoft and joined the Hunyuan organization at Tencent; their first post-move public deliverable was reported to be Hunyuan-TurboS 0416.[^6] As of the same source, neither Microsoft nor Tencent had commented officially on the transfer.[^6] The departure ended Microsoft's institutional sponsorship of the original WizardLM and Evol-Instruct effort.
License terms for the Wizard family are inherited from the underlying base models and are not uniform across the family:[^8][^9]
| Sub-family | Underlying base | License of fine-tune |
|---|---|---|
| WizardLM V1.0 / V1.1, 30B | LLaMA (v1) | LLaMA research, non-commercial |
| WizardLM-13B-V1.2, WizardLM-70B-V1.0 | Llama 2 | Llama 2 Community License |
| WizardCoder-15B | StarCoder | BigCode OpenRAIL-M (StarCoder license) |
| WizardCoder-Python | Code Llama | Llama 2 Community License |
| WizardCoder-33B-V1.1 | DeepSeek-Coder-33B | DeepSeek license (research-permissive) |
| WizardMath 7B / 13B / 70B V1.0 | Llama 2 | Llama 2 Community License |
| WizardMath-7B-V1.1 | Mistral 7B | Apache 2.0 (Mistral base) |
| WizardLM-2 7B, 8x22B | Mistral / Mixtral 8x22B | Apache 2.0 (announced) |
| WizardLM-2 70B | Llama 2 70B | Llama 2 Community (announced) |
The Evol-Instruct datasets themselves are released under CC BY-NC 4.0, with the project's code under Apache 2.0; the team's GitHub repository carries an explicit notice that the Evol-Instruct data is intended for academic use only because it was generated by querying OpenAI APIs, whose terms of service forbid using outputs to train competing products.[^8] This last clause has been the most frequently cited practical limitation of building production systems on top of Evol-Instruct corpora.
Evol-Instruct has had outsized influence on instruction-data synthesis. By 2024 the technique was widely cited as a reference method for synthetic data generation in instruction tuning alongside Stanford Alpaca's Self-Instruct.[^15] Microsoft's own follow-up "Automatic Instruction Evolving for Large Language Models" (Auto Evol-Instruct, June 2024) generalises Evol-Instruct by having a meta-LLM design the evolution prompts automatically rather than hand-writing the five in-depth operators, and reports improvements on MT-Bench, AlpacaEval, GSM8K, and HumanEval against the hand-designed baseline.[^15] Numerous third-party fine-tuning datasets and pipelines, including community Open-Hermes-style corpora and the long tail of code-specific supervised fine-tuning datasets, advertise an "Evol-Instruct" stage as part of their data pipeline.
Three broader contributions are commonly credited to the WizardLM line:
Several limitations have been documented in the literature and in community discussion:
| Approach | Source of new prompts | Source of new responses | Open Wizard parallel |
|---|---|---|---|
| Stanford Alpaca (Self-Instruct) | LLM bootstrapped from seed | Same LLM | seed of Evol-Instruct |
| Vicuna (language model) | ShareGPT user logs | ChatGPT (logged) | baseline chat comparator |
| Evol-Instruct (WizardLM) | LLM rewrite of seed prompts via fixed operators | Teacher LLM | core method |
| Auto Evol-Instruct | Meta-LLM-designed rewriting operators | Teacher LLM | follow-up by same group |
| OpenHermes / Nous-style mixes | Human-curated mixture of public sources | Various | community packaging that often includes Evol-Instruct |
| Constitutional AI | Fixed seed prompts | LLM critiqued against a principle list | parallel safety-focused synthetic-data approach |
Compared with Stanford Alpaca's Self-Instruct, Evol-Instruct's distinguishing feature is vertical rather than horizontal expansion: instead of using the teacher to brainstorm new instructions from a small seed, it uses the teacher to push existing instructions up the difficulty axis. Compared with Vicuna, which fine-tunes on real user logs from ShareGPT, Evol-Instruct trades realism for controllable difficulty and topical balance.