WizardLM

Large Language Models Open Source AI

20 min read

Updated Jun 7, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 7, 2026

Fact-checked

In review queue

Sources

16 citations

Revision

v2 · 4,003 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

WizardLM is a family of open-weights instruction-tuned LLaMA-derived large language models and an associated data-synthesis methodology, both produced by a research group at Microsoft led by Can Xu. The defining contribution is Evol-Instruct, an algorithm that uses a strong teacher language model to iteratively rewrite an existing seed instruction dataset into more complex and more diverse instructions through a small fixed set of "in-depth" and "in-breadth" mutation operators.^[1] Fine-tuning open base models on the resulting Evol-Instruct corpora produced WizardLM (general chat), WizardCoder (code generation), and WizardMath (mathematical reasoning), each of which posted state-of-the-art open-source numbers on contemporaneous evaluations such as MT-Bench, AlpacaEval, HumanEval, and GSM8K when released between April 2023 and the end of 2023.^[1]^[2]^[3] In April 2024 the same group released WizardLM-2 in three sizes (7B, 70B, 8x22B); the announcement was retracted within roughly a day after the team disclosed that mandatory toxicity testing had been skipped, although weights that had been mirrored elsewhere remained accessible.^[4]^[5] In May 2025, core members of the WizardLM team left Microsoft for the Hunyuan group at Tencent.^[6]

Origins and team

The original WizardLM paper, "WizardLM: Empowering Large Pre-Trained Language Models to Follow Complex Instructions," was posted to arXiv on 24 April 2023 by Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Qingwei Lin, and Daxin Jiang.^[1] Eight of the nine authors were affiliated with Microsoft (predominantly Microsoft's STCA/Beijing organization), with Jiazhan Feng listed as a Peking University collaborator.^[1] The paper was subsequently accepted at ICLR 2024.^[7] Can Xu, then a senior researcher at Microsoft, is the public face of the group, and the team's GitHub organization (nlpxucan) hosts the code, datasets, and project pages for all WizardLM, WizardCoder, and WizardMath releases.^[8]

The group's research arc through 2023 followed a single repeated pattern: take a publicly available base model (LLaMA, then StarCoder, then Llama-2), construct an Evol-Instruct dataset tailored to the target domain, fine-tune the base model, and publish the resulting checkpoints under names of the form "Wizard*". The same engineering substrate was reused for general dialogue (WizardLM), code (WizardCoder), and math (WizardMath), and the resulting models repeatedly held the top open-source position on their respective benchmarks at release time.^[1]^[2]^[3]

Evol-Instruct: the core algorithm

Evol-Instruct is the central technical contribution and the source of the project's name. It addresses a specific bottleneck in instruction tuning: human-authored or self-generated instruction datasets (for example, the 52K synthetic instructions used by Stanford Alpaca) tend to skew toward simple, surface-level requests and rarely include the long, constraint-heavy queries that exercise reasoning ability.^[1] Rather than ask human annotators to write harder prompts, Evol-Instruct asks a strong language model to rewrite easy prompts into harder ones, then loops.

Seed data and evolution loop

The algorithm starts from a seed instruction dataset; in the original paper this was the 52K-instance Alpaca dataset.^[1] At each evolution epoch, every instruction in the working set is rewritten by querying ChatGPT with one of six fixed "evolution" prompts chosen with equal probability; the rewritten instruction is then sent back to ChatGPT to generate a fresh response, and the (instruction, response) pair is added to a candidate pool.^[1] An elimination step discards evolved instructions that (a) the rewriter judged unanswerable, (b) closely paraphrase the parent, or (c) collapse to an empty or trivial answer.^[1] After four evolution epochs the working set grew from 52K to roughly 250K (instruction, response) pairs; for fair comparison with Vicuna, a 70K subset was sampled for fine-tuning the released WizardLM-7B.^[1]

In-depth evolving operators

Five of the six rewriting prompts are "in-depth" operators, each designed to make a prompt harder without changing its topic:^[1]

Add Constraints. Inject an additional requirement, restriction, or stipulation into the prompt (for example, "and the answer must be expressed as a polynomial").
Deepening. Increase the depth or breadth of the inquiry, replacing surface questions with ones that demand more substantive analysis.
Concretizing. Replace generic concepts in the prompt with more specific instances ("an animal" becomes "a Galapagos tortoise").
Increase Reasoning Steps. Rewrite the prompt to explicitly demand multi-step reasoning, often by chaining sub-questions.
Complicating Input. Add or transform structured input formats such as XML, JSON, SQL, Python source, HTML, or shell commands.

In-breadth evolving operator

The sixth operator is the "in-breadth" mutation. Instead of rewriting the prompt, it asks the teacher model to produce a new prompt from the same domain that is rarer but of comparable length and difficulty.^[1] In-breadth evolution exists to combat topical collapse: applying only the five in-depth operators tends to push every prompt toward longer, more constrained variants of the same handful of starting topics, while in-breadth mutation broadens the topic distribution.

Why the evolved data helps

The paper's central empirical claim is that Evol-Instruct shifts the difficulty distribution of the instruction corpus to the right. Using a GPT-4-as-judge difficulty score, the authors show that Alpaca instructions concentrate at low complexity, ShareGPT/Vicuna instructions span a wider but still mid-heavy range, and Evol-Instruct's four-epoch output has substantial mass at the high-difficulty tail.^[1] When LLaMA-7B is fine-tuned on the Evol-Instruct 70K subset, blind pairwise human comparison on a 218-prompt 29-skill test bed shows that annotators preferred WizardLM-7B over ChatGPT outputs on the highest-difficulty bucket (difficulty score >= 8), while ChatGPT retained an advantage on easy prompts.^[1] GPT-4-based automatic scoring on the same set placed WizardLM-7B at roughly 90 percent of ChatGPT's quality on 17 of 29 skill categories.^[1]

Training configuration of the original WizardLM-7B

The first released checkpoint, WizardLM-7B-V1.0, used LLaMA-7B as the base model and was trained for three epochs on the 70K Evol-Instruct subset using eight V100 GPUs with DeepSpeed ZeRO-3, learning rate 2 x 10^-5, batch size 8, and a maximum context length of 2048 tokens; total wall time was approximately 70 hours.^[1] The original model card and weights were released under the LLaMA research license, with the Evol-Instruct dataset itself published separately on Hugging Face (WizardLM_evol_instruct_70k).^[8]

Published Evol-Instruct datasets

Two Evol-Instruct corpora were released publicly through the team's Hugging Face organization, WizardLMTeam:^[8]

Dataset	Size	Seed	Notes
WizardLM_evol_instruct_70k	70K (instruction, response) pairs	Stanford Alpaca 52K	Subset used to train WizardLM-7B-V1.0
WizardLM_evol_instruct_V2_196k	196K pairs	Mixed (Alpaca + ShareGPT seeds)	Used for V1.1/V1.2 chat models

Code Alpaca's 20K instruction set was the seed for the WizardCoder evolution run, which produced roughly 78K (code prompt, code response) pairs after three rounds of evolution; that corpus was not released as a standalone dataset but underlies the WizardCoder checkpoints.^[2] The WizardMath release similarly built a math-specific Evol-Instruct corpus, plus a process-supervision corpus for reward modelling, neither of which was released as a separate file.^[3]

WizardLM chat models

After the initial WizardLM-7B paper, the group released a sequence of larger and revised chat checkpoints. The Hugging Face WizardLMTeam organization lists ten models in the Wizard line (excluding WizardLM-2, which was uploaded then removed from Microsoft channels), of which four are general-purpose chat models.^[8]

Model	Base model	Release	Reported MT-Bench	Reported AlpacaEval win rate
WizardLM-7B-V1.0	LLaMA 7B	April 2023	not released with paper	not released with paper
WizardLM-13B-V1.0	LLaMA 13B	Sep 2023 (HF update)	not centrally reported	not centrally reported
WizardLM-13B-V1.1	LLaMA 13B	Jul 2023	comparable to Llama-2-13B-Chat (~6.77 reported by the team)	not centrally reported
WizardLM-13B-V1.2	Llama 2 13B	Sep 2023	7.06	89.17%
WizardLM-30B-V1.0	LLaMA 30B	June 2023	not centrally reported	not centrally reported
WizardLM-70B-V1.0	Llama 2 70B	Aug 2023	not centrally reported	not centrally reported

WizardLM-13B-V1.2's MT-Bench score of 7.06 and AlpacaEval win rate of 89.17 percent are taken from the model card on Hugging Face; the same card lists a HumanEval pass@1 of 36.6 percent despite the model not being specifically code-tuned, illustrating positive transfer from the broader Evol-Instruct distribution.^[9] WizardLM-30B-V1.0 was widely cited on the AlpacaEval and Hugging Face Open LLM leaderboards in mid-2023 as the strongest open chat model under 65 billion parameters at the time, before being eclipsed in autumn 2023 by Llama-2-based fine-tunes (including WizardLM's own V1.2 and 70B checkpoints).^[8] The chat models inherit their licensing constraint from their base: LLaMA-1 derivatives (V1.0, V1.1, 30B) are governed by Meta's original non-commercial LLaMA research license, while WizardLM-13B-V1.2 and WizardLM-70B-V1.0 inherit the Llama 2 Community License.^[9]

The chat models adopted the Vicuna prompt template (a USER: / ASSISTANT: turn-based wrapper terminated by </s>), reflecting Vicuna (language model)'s prior establishment as the de facto reference open chat model.^[9]

WizardCoder

"WizardCoder: Empowering Code Large Language Models with Evol-Instruct," arXiv:2306.08568, was posted on 14 June 2023 by Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang.^[2] The paper applies Evol-Instruct to code instructions, using Code Alpaca's 20K examples as the seed and StarCoder-15B as the base model.^[2]

Method

Code-specific evolution reuses the five in-depth operators of Evol-Instruct with prompt wording tailored to code (for example, "Add Constraints" becomes a directive to add a runtime, memory, or API restriction, and "Increase Reasoning Steps" instructs the rewriter to require explicit algorithmic justification before code).^[2] Three rounds of evolution produced approximately 78K (instruction, code-response) pairs. WizardCoder-15B was fine-tuned from StarCoder-15B on this dataset with batch size 512 for 200 fine-tuning steps.^[2]

Reported benchmarks

The paper's headline result, on the HumanEval pass@1 metric, was 57.3 percent for WizardCoder-15B versus 35.0 percent for the StarCoder-15B base, a 22.3-point improvement; on MBPP pass@1 the model reported 51.8 percent versus 43.6 percent for StarCoder.^[2] The paper explicitly compared against contemporary proprietary code models, reporting WizardCoder-15B above Anthropic's Claude-Plus (53.0 percent HumanEval pass@1) and Google Bard (44.5 percent), while remaining below GPT-4's 67.0 percent on the same benchmark.^[2]

Subsequent code checkpoints

After the original 15B model, the team published a series of WizardCoder variants based on different code base models:^[8]

Checkpoint	Base	HF release
WizardCoder-15B-V1.0	StarCoder-15B	Jun 2023
WizardCoder-Python-7B-V1.0	Code Llama Python 7B	Aug 2023
WizardCoder-Python-13B-V1.0	Code Llama Python 13B	Aug 2023
WizardCoder-Python-34B-V1.0	Code Llama Python 34B	Aug 2023
WizardCoder-33B-V1.1	DeepSeek-Coder-33B-Base	Jan 2024

WizardCoder-33B-V1.1, trained on top of the DeepSeek-Coder-33B-Base model, was reported to reach 79.9 percent HumanEval pass@1, briefly making it the highest-scoring openly distributed code model on that benchmark when released.^[8] WizardCoder was accepted at ICLR 2024.^[7]

WizardMath

"WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct," arXiv:2308.09583, was posted on 18 August 2023 by Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, Yansong Tang, and Dongmei Zhang.^[3] It introduces a training pipeline called Reinforcement Learning from Evol-Instruct Feedback (RLEIF), which combines Evol-Instruct-style data synthesis with a two-headed reward model and Proximal Policy Optimization (PPO).^[3]

Math-specific Evol-Instruct

The first stage applies a modified Evol-Instruct to math word problems, with two directions of evolution: "upward" evolution increases difficulty (more constraints, more reasoning steps, harder numbers), and "downward" evolution rewrites instructions toward simpler variants to diversify the distribution.^[3] Roughly 96K evolved math problems were generated across eight iteration cycles.^[3]

IRM and PRM reward models

RLEIF trains two reward models:^[3]

The Instruction Reward Model (IRM) scores evolved instructions along three axes (Definition, Precision, Integrity), using ChatGPT-generated rankings of evolved prompts as training labels.
The Process Reward Model (PRM) evaluates the correctness of each step in a generated chain-of-thought solution rather than only the final answer; step labels are obtained from ChatGPT feedback.

Reinforcement learning

In the PPO stage, the math policy is rewarded by a product of the IRM score (judging the quality of the input instruction context) and the PRM score (judging the quality of the model's own step-by-step solution), with the combined signal driving the PPO update.^[3] RLEIF is conceptually a process-supervised variant of Reinforcement Learning from Human Feedback (RLHF) in which both the supervision signal and the data are produced by AI rather than humans, placing it close to RLAIF in spirit while predating that label's broad adoption.

Reported benchmark results

WizardMath was released initially as a Llama-2 fine-tune at three scales. Reported pass@1 numbers from the paper (and the accompanying project page) are:^[3]^[10]

Model	GSM8K pass@1	MATH pass@1
WizardMath-7B-V1.0 (Llama-2 7B)	54.9%	10.7%
WizardMath-13B-V1.0 (Llama-2 13B)	63.9%	14.0%
WizardMath-70B-V1.0 (Llama-2 70B)	81.6%	22.7%
ChatGPT (3.5) reference	80.8%	34.1%
Claude-2 reference	88.0%	not reported
PaLM-2 540B reference	80.7%	34.3%
Minerva 540B reference	58.8%	33.6%
GPT-4 reference	92.0%	42.5%

The 70B model thus matched ChatGPT and the much larger PaLM-2 on GSM8K grade-school math while remaining well below GPT-4 on the harder MATH dataset.^[3] A later WizardMath-7B-V1.1 (December 2023), trained from Mistral 7B rather than Llama-2 7B, was reported on the model card to reach 83.2 percent GSM8K pass@1, surpassing the original 70B Llama-2-based model on that benchmark with one-tenth the parameter count.^[8] WizardMath was accepted at ICLR 2025 as an Oral.^[11]

WizardLM-2: release and withdrawal

On 15 April 2024, the WizardLM team announced WizardLM-2, presented on the project page wizardlm.github.io/WizardLM2/ as a three-model family.^[4]

Announced configuration

Model	Base	License (announced)	Headline claim
WizardLM-2 7B	Mistral-style 7B base	Apache 2.0	"performance comparable with existing 10x larger" open models
WizardLM-2 70B	Llama 2 70B	Llama 2 Community	"top-tier reasoning"
WizardLM-2 8x22B	Mixtral 8x22B	Apache 2.0	first open model with MT-Bench score over 9.00

The announcement attributed the gains to a synthetic data and post-training pipeline branded "AI Align AI" (AAA), which the team described as multiple state-of-the-art models co-teaching and self-teaching each other, combined with an "Evol Lab" (improved Evol-Instruct plus an Evol-Answer step), Stage-DPO progressive preference learning (DPO applied in successive curriculum stages), and the RLEIF objective inherited from WizardMath.^[4]

Performance claims

On blind pairwise human preference comparisons reported by the team, WizardLM-2 8x22B was placed slightly below GPT-4-1106-preview but above Command R+ and GPT-4-0314; WizardLM-2 70B was placed above GPT-4-0613, Mistral-Large, and Qwen1.5-72B-Chat; WizardLM-2 7B was placed at the level of Qwen1.5-32B-Chat.^[4] On the MT-Bench automated metric, WizardLM-2 8x22B reportedly became the first openly distributed model to break the 9.00 score barrier.^[12]

Withdrawal

Within roughly a day of the announcement, the official WizardLM-2 weights and project page were taken down from Microsoft-controlled channels.^[5]^[13] On 16 April 2024 the team's X (Twitter) account posted an apology stating that "we accidentally missed an item required in the model release process, toxicity testing"; the team said the omitted tests would be completed and the models re-released afterwards.^[5]^[13] Reporting at the time noted that The Information had first highlighted the absence of toxicity testing, prompting the takedown.^[13] Because the weights had been mirrored to community-controlled Hugging Face repositories and to GitHub during the short public window, third-party copies of WizardLM-2 8x22B and 7B continued to circulate even after the official removal.^[14] As of the date of writing, Microsoft has not formally re-released WizardLM-2 through its own channels.

Departure to Tencent

In May 2025, Can Xu and Qingfeng Sun, identified as the public leads of the WizardLM project, announced on X that the WizardLM team had left Microsoft and joined the Hunyuan organization at Tencent; their first post-move public deliverable was reported to be Hunyuan-TurboS 0416.^[6] As of the same source, neither Microsoft nor Tencent had commented officially on the transfer.^[6] The departure ended Microsoft's institutional sponsorship of the original WizardLM and Evol-Instruct effort.

Licensing and access caveats

License terms for the Wizard family are inherited from the underlying base models and are not uniform across the family:^[8]^[9]

Sub-family	Underlying base	License of fine-tune
WizardLM V1.0 / V1.1, 30B	LLaMA (v1)	LLaMA research, non-commercial
WizardLM-13B-V1.2, WizardLM-70B-V1.0	Llama 2	Llama 2 Community License
WizardCoder-15B	StarCoder	BigCode OpenRAIL-M (StarCoder license)
WizardCoder-Python	Code Llama	Llama 2 Community License
WizardCoder-33B-V1.1	DeepSeek-Coder-33B	DeepSeek license (research-permissive)
WizardMath 7B / 13B / 70B V1.0	Llama 2	Llama 2 Community License
WizardMath-7B-V1.1	Mistral 7B	Apache 2.0 (Mistral base)
WizardLM-2 7B, 8x22B	Mistral / Mixtral 8x22B	Apache 2.0 (announced)
WizardLM-2 70B	Llama 2 70B	Llama 2 Community (announced)

The Evol-Instruct datasets themselves are released under CC BY-NC 4.0, with the project's code under Apache 2.0; the team's GitHub repository carries an explicit notice that the Evol-Instruct data is intended for academic use only because it was generated by querying OpenAI APIs, whose terms of service forbid using outputs to train competing products.^[8] This last clause has been the most frequently cited practical limitation of building production systems on top of Evol-Instruct corpora.

Significance and broader influence

Evol-Instruct has had outsized influence on instruction-data synthesis. By 2024 the technique was widely cited as a reference method for synthetic data generation in instruction tuning alongside Stanford Alpaca's Self-Instruct.^[15] Microsoft's own follow-up "Automatic Instruction Evolving for Large Language Models" (Auto Evol-Instruct, June 2024) generalises Evol-Instruct by having a meta-LLM design the evolution prompts automatically rather than hand-writing the five in-depth operators, and reports improvements on MT-Bench, AlpacaEval, GSM8K, and HumanEval against the hand-designed baseline.^[15] Numerous third-party fine-tuning datasets and pipelines, including community Open-Hermes-style corpora and the long tail of code-specific supervised fine-tuning datasets, advertise an "Evol-Instruct" stage as part of their data pipeline.

Three broader contributions are commonly credited to the WizardLM line:

A reusable difficulty-shifting recipe. The in-depth operators are simple natural-language prompts that practitioners can apply to any seed dataset using any sufficiently strong teacher model, making the method substrate-agnostic and easy to replicate.^[1]
Multi-domain validation. By instantiating the same recipe in code (WizardCoder) and math (WizardMath) and topping the respective open benchmarks each time, the team demonstrated that the difficulty-shift hypothesis is not specific to general chat.^[2]^[3]
Process-supervised RL from synthetic data. RLEIF was an early concrete instantiation of training a process reward model (PRM) on AI-generated step-level labels and using it to drive PPO, prefiguring later work that pushed PRMs as a central tool in reasoning-oriented post-training.^[3]

Limitations and criticisms

Several limitations have been documented in the literature and in community discussion:

Teacher contamination. Because Evol-Instruct uses ChatGPT (and later GPT-4) as the rewriter and answer generator, the resulting student model inherits the teacher's idiosyncrasies, style, refusals, and factual errors. The OpenAI terms-of-service clause restricting use of outputs for competing-product training also forces the published Evol-Instruct datasets to be released under non-commercial licenses.^[8]
Benchmark saturation and leakage. Several open-source community analyses observed in 2023 and 2024 that high MT-Bench and AlpacaEval scores for Evol-Instruct-style fine-tunes correlated only weakly with downstream user preference in production deployments, and noted that GPT-4-judged benchmarks systematically reward Evol-Instruct-style verbose, structured outputs.^[16]
Operator brittleness. The five in-depth operators are hand-engineered and not provably exhaustive; Auto Evol-Instruct argues that learned evolution prompts outperform the hand-designed set on multiple benchmarks, suggesting that the original prompt suite was a strong starting point but not optimal.^[15]
The WizardLM-2 incident. The April 2024 withdrawal illustrated that, despite the team's strong evaluation infrastructure for capability benchmarks, mandatory safety steps in the Microsoft release pipeline (specifically toxicity testing) were not part of the team's standard pre-release checks.^[5]^[13] The incident was widely cited as a case study in why "release process" gating must be enforced procedurally rather than at the team's discretion. Because mirrored copies were preserved by the community, the takedown also illustrated the limited effectiveness of post-hoc retraction once weights have entered open repositories.^[14]
Reproducibility under license drift. Because each WizardLM checkpoint inherits the license of its base model, downstream users must individually verify whether their target use is permitted under the LLaMA, Llama 2, StarCoder, Code Llama, DeepSeek, Mistral, or Mixtral terms; the resulting matrix is non-trivial and has been a frequent source of confusion in adoption.^[8]

Comparison with adjacent approaches

Approach	Source of new prompts	Source of new responses	Open Wizard parallel
Stanford Alpaca (Self-Instruct)	LLM bootstrapped from seed	Same LLM	seed of Evol-Instruct
Vicuna (language model)	ShareGPT user logs	ChatGPT (logged)	baseline chat comparator
Evol-Instruct (WizardLM)	LLM rewrite of seed prompts via fixed operators	Teacher LLM	core method
Auto Evol-Instruct	Meta-LLM-designed rewriting operators	Teacher LLM	follow-up by same group
OpenHermes / Nous-style mixes	Human-curated mixture of public sources	Various	community packaging that often includes Evol-Instruct
Constitutional AI	Fixed seed prompts	LLM critiqued against a principle list	parallel safety-focused synthetic-data approach

Compared with Stanford Alpaca's Self-Instruct, Evol-Instruct's distinguishing feature is vertical rather than horizontal expansion: instead of using the teacher to brainstorm new instructions from a small seed, it uses the teacher to push existing instructions up the difficulty axis. Compared with Vicuna, which fine-tunes on real user logs from ShareGPT, Evol-Instruct trades realism for controllable difficulty and topical balance.

References

Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Qingwei Lin, Daxin Jiang, "WizardLM: Empowering Large Pre-Trained Language Models to Follow Complex Instructions", arXiv, 2023-04-24. https://arxiv.org/abs/2304.12244. Accessed 2026-05-20. ↩
Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, Daxin Jiang, "WizardCoder: Empowering Code Large Language Models with Evol-Instruct", arXiv, 2023-06-14. https://arxiv.org/abs/2306.08568. Accessed 2026-05-20. ↩
Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, Yansong Tang, Dongmei Zhang, "WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct", arXiv, 2023-08-18. https://arxiv.org/abs/2308.09583. Accessed 2026-05-20. ↩
WizardLM Team, "WizardLM-2", project page, wizardlm.github.io, 2024-04-15. https://wizardlm.github.io/WizardLM2/. Accessed 2026-05-20. ↩
elblog.pl, "Microsoft Pulls AI Model WizardLM-2 for Lacking Toxicity Tests", 2024-04-17. https://elblog.pl/2024/04/17/microsoft-pulls-ai-model-wizardlm-2-for-lacking-toxicity-tests/. Accessed 2026-05-20. ↩
Kyle Wiggers, "Tencent hires WizardLM team, a Microsoft AI group with an odd history", TechCrunch, 2025-05-13. https://techcrunch.com/2025/05/13/tencent-hires-wizardlm-team-a-microsoft-ai-group-with-an-odd-history/. Accessed 2026-05-20. ↩
OpenReview, "WizardLM: Empowering Large Pre-Trained Language Models to Follow Complex Instructions", ICLR 2024 conference paper, 2024. https://openreview.net/forum?id=CfXh93NDgH. Accessed 2026-05-20. ↩
nlpxucan / WizardLMTeam, "WizardLM GitHub repository and HuggingFace organization", 2023 to 2024. https://github.com/nlpxucan/WizardLM and https://huggingface.co/WizardLMTeam. Accessed 2026-05-20. ↩
WizardLMTeam, "WizardLM-13B-V1.2 model card", Hugging Face, 2023-09-09. https://huggingface.co/WizardLMTeam/WizardLM-13B-V1.2. Accessed 2026-05-20. ↩
WizardLM Team, "WizardMath", project page, wizardlm.github.io, 2023. https://wizardlm.github.io/WizardMath/. Accessed 2026-05-20. ↩
OpenReview, "WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct", ICLR 2025 Oral, 2025. https://openreview.net/forum?id=mMPMHWOdOy. Accessed 2026-05-20. ↩
Asif Razzaq, "WizardLM-2: An Open Source AI Model that Claims to Outperform GPT-4 in the MT-Bench Benchmark", MarkTechPost, 2024-04-16. https://www.marktechpost.com/2024/04/16/wizardlm-2-an-open-source-ai-model-that-claims-to-outperform-gpt-4-in-the-mt-bench-benchmark/. Accessed 2026-05-20. ↩
Markus Kasanmascheff, "The Brief Appearance and Disappearance of Microsoft's Latest AI Model WizardLM-2", WinBuzzer, 2024-04-25. https://winbuzzer.com/2024/04/25/the-brief-appearance-and-disappearance-of-microsofts-latest-ai-model-wizardlm-2-xcxwbn/. Accessed 2026-05-20. ↩
CTOL Digital Solutions, "Microsoft Withdraws WizardLM-2 AI Model Over Missing Toxicity Testing", CTOL Digital, 2024-04. https://www.ctol.digital/news/microsoft-withdraws-wizardlm-2-ai-model/. Accessed 2026-05-20. ↩
Weihao Zeng, Can Xu, Yingxiu Zhao, Jian-Guang Lou, Weizhu Chen, "Automatic Instruction Evolving for Large Language Models", arXiv, 2024-06-02. https://arxiv.org/abs/2406.00770. Accessed 2026-05-20. ↩
Yann Dubois et al., "AlpacaEval and the Limits of LLM-Judge Benchmarks", AlpacaEval Leaderboard documentation, Tatsu Lab, Stanford, 2023 to 2024. https://tatsu-lab.github.io/alpaca_eval/. Accessed 2026-05-20. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributor · full history

Suggest edit

What links here

Meta AI

Origins and team

Evol-Instruct: the core algorithm

Seed data and evolution loop

In-depth evolving operators

In-breadth evolving operator

Why the evolved data helps

Training configuration of the original WizardLM-7B

Published Evol-Instruct datasets

WizardLM chat models

WizardCoder

Method

Reported benchmarks

Subsequent code checkpoints

WizardMath

Math-specific Evol-Instruct

IRM and PRM reward models

Reinforcement learning

Reported benchmark results

WizardLM-2: release and withdrawal

Announced configuration

Performance claims

Withdrawal

Departure to Tencent

Licensing and access caveats

Significance and broader influence

Limitations and criticisms

Comparison with adjacent approaches

See also

References

Improve this article

Related Articles

LLaMA

Proprietary vs. Open Source Large Language Models (LLMs)

DeepSeek

LangChain

Meta AI

Mistral AI

What links here

Related Articles

LLaMA

Proprietary vs. Open Source Large Language Models (LLMs)

DeepSeek

LangChain

Meta AI

Mistral AI