Qwen2-Math
Last reviewed
Jun 3, 2026
Sources
5 citations
Review status
Source-backed
Revision
v1 · 1,723 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
5 citations
Review status
Source-backed
Revision
v1 · 1,723 words
Add missing citations, update stale details, or suggest a clearer explanation.
Qwen2-Math is a series of mathematics-specialized large language models released by the Qwen team at Alibaba on 8 August 2024. The series consists of three open-weight sizes, 1.5 billion, 7 billion, and 72 billion parameters, each shipped as a base model and an instruction-tuned (Instruct) variant. The models are built on the general-purpose Qwen2 base checkpoints and then continue-pretrained on a large mathematics-focused corpus, so they inherit Qwen2's architecture and tokenizer while specializing in arithmetic, algebra, and competition-level problem solving [1][2].
At release the flagship Qwen2-Math-72B-Instruct was positioned as the strongest math model then available, with the Qwen team reporting that it outperformed leading closed-source systems including OpenAI's GPT-4o, Anthropic's Claude 3.5 Sonnet, Google's Gemini 1.5 Pro, and the open Llama-3.1-405B-Instruct on standard mathematical reasoning benchmarks [1]. The series was short-lived as a flagship: roughly one month later the team released its successor, Qwen2.5-Math, which added the ability to write and run Python code while reasoning and extended language support from English-only to both Chinese and English [2][3].
The Qwen2-Math models descend directly from the Qwen2 generation announced in June 2024. Rather than train a math model from scratch, the team initialized Qwen2-Math-1.5B/7B/72B from the corresponding Qwen2 base checkpoints and continued pretraining them on mathematics data [1]. This is the same continued-pretraining recipe the team used for its code line, and it lets the math models share Qwen2's decoder-only transformer design and 151,646-token vocabulary while concentrating their additional training budget on mathematical text.
The work was documented after the fact in the Qwen2.5-Math technical report (arXiv:2409.12122), submitted on 18 September 2024, which describes both Qwen2-Math and Qwen2.5-Math as stages in a single self-improvement pipeline. In that pipeline the earlier Qwen2-Math-Instruct models are used to synthesize and filter additional training data for the later Qwen2.5-Math models, so the two releases are tightly coupled even though Qwen2-Math was announced first through a standalone blog post [1][4].
The series spans three dense sizes. Each is released as a base model intended for downstream fine-tuning and as an Instruct model aligned for direct use. The base models are continue-pretrained on the Qwen Math Corpus version 1, which the technical report describes as containing approximately 700 billion tokens of high-quality mathematical data [4].
| Size | Base checkpoint | Variants | Context | License |
|---|---|---|---|---|
| 1.5B | Qwen2-1.5B | Base, Instruct | 4K | Apache 2.0 |
| 7B | Qwen2-7B | Base, Instruct | 4K | Tongyi Qianwen |
| 72B | Qwen2-72B | Base, Instruct | 4K | Tongyi Qianwen |
All three sizes are decoder-only transformers carried over from Qwen2, using rotary position embeddings, SwiGLU activations, RMSNorm, and grouped-query attention. The models target text-only mathematical reasoning and, in this first series, solve problems through natural-language chain-of-thought (CoT) alone, without any tool or code execution [1].
The base models are pretrained on the Qwen Math Corpus v1, a mathematics-specific dataset assembled from large-scale high-quality mathematical web text, books, code, exam questions, and additional mathematical data synthesized by Qwen2 itself [1][4]. The corpus is reported at roughly 700 billion tokens, which the successor series later expanded to more than one trillion tokens as Qwen Math Corpus v2 [4].
A central concern for a benchmark-heavy domain like mathematics is test-set contamination, where evaluation questions leak into the pretraining data and inflate scores. The Qwen team applied a decontamination step that removed training samples overlapping with widely used evaluation sets. Beyond exact-match removal, the team used 13-gram matching with an additional condition that a sample is treated as contaminated only when the ratio of the longest common subsequence to the candidate exceeds 0.6, which catches near-duplicates as well as verbatim copies [1][4]. The datasets screened out this way include GSM8K, MATH, Aqua, SAT Math, OlympiadBench, College Math, AIME 2024, and AMC 2023 [1].
The Instruct models are produced from the base checkpoints through supervised fine-tuning (SFT) followed by reinforcement learning, both driven by a dedicated mathematics reward model. The team trained a math-specific reward model based on Qwen2-Math-72B. Its training signal combines a dense reward, which scores the quality of intermediate reasoning, with a binary signal that simply indicates whether the model reached the correct final answer [1].
This reward model is used at two stages. First, it guides rejection sampling: the base model generates many candidate solutions, and the reward model selects high-quality, correct chains of reasoning to build the SFT dataset. Second, after SFT, the reward model supervises reinforcement learning using Group Relative Policy Optimization (GRPO), the same RL algorithm popularized by DeepSeekMath, in which a group of sampled responses is scored and each response's advantage is computed relative to the group mean and standard deviation [1][4]. The combination of rejection-sampled SFT data and GRPO is what produces the final Qwen2-Math-Instruct models.
At inference time the team also reports decoding strategies that exploit the reward model. In addition to greedy decoding, the Instruct models can draw multiple samples and aggregate them by majority vote (Maj@N) or by selecting the highest-reward sample (RM@N). The blog reports that reward-model selection (RM@8) outperforms plain majority voting (Maj@8), and that this gap is largest on the smaller 1.5B and 7B models, where the base reasoning is weaker and the reward model adds the most value [1].
The Qwen team evaluated Qwen2-Math on a standard battery of English mathematical reasoning benchmarks spanning grade-school word problems (GSM8K), competition-style problems (MATH), undergraduate material (College Math, Minerva Math), Chinese college-entrance problems rendered in English (GaoKao 2023 En), and olympiad problems (OlympiadBench). The table below lists chain-of-thought results for the three Instruct models as reported in the Qwen2.5-Math technical report, which evaluates the Qwen2-Math line alongside its successor [4].
| Benchmark | Qwen2-Math-1.5B-Instruct | Qwen2-Math-7B-Instruct | Qwen2-Math-72B-Instruct |
|---|---|---|---|
| GSM8K | 84.2 | 89.9 | 96.7 |
| MATH | 69.4 | 75.1 | 84.0 |
| College Math | 44.2 | 45.9 | 47.9 |
| Minerva Math | 29.4 | 34.6 | 40.1 |
| GaoKao 2023 En | 59.7 | 62.1 | 68.3 |
| OlympiadBench | 31.3 | 38.2 | 43.0 |
For competition mathematics the team reported how many problems each model solved on the 2024 American Invitational Mathematics Examination (AIME 2024) and the 2023 American Mathematics Competitions (AMC 2023). Using greedy decoding, Qwen2-Math-72B-Instruct solved 6 of 30 AIME 2024 problems and 24 of 40 AMC 2023 problems, with the counts rising further under sampling-based strategies such as Maj@64 and RM@256 [1][4].
The headline comparison at release was against the strongest general-purpose systems of the time. On the MATH benchmark, Qwen2-Math-72B-Instruct's 84.0 exceeded the score the team measured for GPT-4o (the 2024-08-06 version), reported at 81.1, and on GSM8K it reached 96.7 against GPT-4o's 92.9 [4]. On this basis the Qwen team described the 72B Instruct model as outperforming GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, and Llama-3.1-405B-Instruct on mathematical reasoning, making it, at the time, the leading model on these math evaluations regardless of whether the competitor was open or closed [1]. As with all self-reported benchmark comparisons, the figures reflect the prompting and evaluation harness chosen by the model's authors rather than an independent assessment.
The first Qwen2-Math series carried two notable limitations that the team acknowledged directly. The models support only English: the blog post states plainly that "this model mainly supports English" and promises that "we will release bilingual (English and Chinese) math models soon" [1]. They also reason in natural language alone, using chain-of-thought without any ability to call external tools, so a long arithmetic computation or a symbolic manipulation has to be carried out token by token rather than by running code. Both limitations were addressed in the successor series.
The three sizes are not released under a single license. Qwen2-Math-1.5B and its Instruct variant are published under the permissive Apache License 2.0, which allows commercial use, modification, and redistribution. The 7B and 72B models are released under the Tongyi Qianwen license, Alibaba's custom community license, which permits commercial use but attaches additional terms compared with Apache 2.0 [1][5]. The same per-size license split was carried forward to the Qwen2.5-Math series. All checkpoints are distributed on Hugging Face in base and Instruct variants, and the inference and evaluation code is published in the QwenLM GitHub repository [2][5].
Qwen2.5-Math was released on 19 September 2024, about a month after Qwen2-Math, and described itself as the world's leading open-source mathematical model series [2][3]. It kept the same three sizes (1.5B, 7B, 72B) but moved them onto the Qwen2.5 base checkpoints and added a separately released reward model, Qwen2.5-Math-RM-72B [3][4].
The two most important changes addressed Qwen2-Math's stated limitations. First, Qwen2.5-Math-Instruct supports both Chinese and English, where Qwen2-Math handled English only [3]. Second, and more significant for accuracy, it added Tool-Integrated Reasoning (TIR): the model can interleave natural-language reasoning with Python code that it writes and executes, so steps such as finding the roots of a quadratic equation or computing the eigenvalues of a matrix are handled by exact computation rather than by predicting digits [2][3]. The successor also embodies a self-improvement loop in which Qwen2-Math-Instruct is used to synthesize the additional pretraining data and the reward model is used to guide sampling during inference, while the pretraining corpus grew from roughly 700 billion tokens to over one trillion [4].
These changes produced a large accuracy jump on hard benchmarks. Qwen2.5-Math-72B-Instruct scored 87.8 on MATH using tool-integrated reasoning, well above the 84.0 its Qwen2-Math predecessor reached with chain-of-thought alone, and the team reported its flagship outperforming GPT-4o and a math-specialized Gemini 1.5 Pro by a wide margin [2][3]. Within the broader Qwen family, Qwen2-Math is thus best understood as the first iteration of a math line that the team refined rapidly, with Qwen2.5-Math superseding it within weeks and later Qwen generations continuing the series.