GSM8K (Grade School Math 8K) is a benchmark dataset of 8,792 grade-school-level math word problems created by researchers at OpenAI to evaluate the multi-step mathematical reasoning capabilities of large language models. Introduced in the 2021 paper "Training Verifiers to Solve Math Word Problems" by Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman, GSM8K has become one of the most widely used benchmarks for measuring how well language models handle arithmetic reasoning tasks that require multiple sequential steps.
The benchmark played a central role in the development of chain-of-thought prompting and verification-based inference strategies. Its influence extends across hundreds of research papers, and it remains a standard evaluation metric reported in model technical reports and academic publications, even as frontier models have begun to saturate the benchmark.
Before GSM8K was introduced, existing math benchmarks for language models were limited in scope and difficulty calibration. Many datasets either contained single-step arithmetic problems that failed to test genuine reasoning, or featured competition-level mathematics that was too difficult to provide meaningful signal for most models. The OpenAI team recognized the need for a benchmark that occupied a middle ground: problems that were conceptually straightforward for humans but required multiple reasoning steps, making them genuinely challenging for language models.
The core insight behind GSM8K was that even the largest transformer models at the time struggled to solve problems that a bright middle school student could handle. This gap between human capability and model performance on basic multi-step math highlighted a fundamental limitation in how language models approached sequential reasoning tasks. By creating a carefully curated dataset of such problems, the researchers aimed to provide a clear, measurable target for improvement.
The timing of GSM8K's release coincided with growing interest in scaling laws and emergent abilities of large language models. Researchers needed benchmarks that could expose specific weaknesses in model reasoning, and grade school math provided an accessible yet challenging domain for this purpose.
GSM8K contains 8,792 problems in total, divided into a training set of 7,473 problems and a test set of 1,319 problems. All problems were written by human contributors and underwent rigorous quality control.
Each problem in GSM8K is a natural language word problem that requires between 2 and 8 steps to solve. Solutions involve only basic arithmetic operations: addition, subtraction, multiplication, and division. The problems are designed so that all intermediate calculations are manageable without a calculator (for example, multiplying 7 by 8 or adding 36 and 110). Every problem has a single integer as its final answer.
The problems cover a range of everyday scenarios involving money, quantities, time, distances, and other practical contexts. They are linguistically diverse, meaning they use varied sentence structures and vocabulary rather than following a single template.
A typical GSM8K problem looks like this:
Question: Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?
Answer: Natalia sold 48 / 2 = <<48/2=24>>24 clips in May. Natalia sold 48 + 24 = <<48+24=72>>72 clips altogether in April and May. #### 72
The answer format includes natural language explanations interleaved with calculator annotations (the <<expression=result>> notation), followed by the delimiter #### and the final numeric answer. This structured format allows automated scoring by simply parsing the number after the #### token.
The dataset is stored as JSON Lines (.jsonl) files, with each line containing a dictionary with two keys: "question" and "answer". The dataset is available in two configurations:
| Configuration | Description | Train | Test |
|---|---|---|---|
| Main | Standard question-answer pairs | 7,473 | 1,319 |
| Socratic | Includes auto-generated Socratic subquestions before each solution step | 7,473 | 1,319 |
The Socratic variant includes additional guiding subquestions (such as "How many clips did Natalia sell in May?") prepended to each reasoning step. These subquestions were generated by a specialized fine-tuned model trained on approximately 800 examples.
The problems in GSM8K were created by Surge AI in partnership with OpenAI's reinforcement learning team. The creation process involved several deliberate steps to ensure quality and diversity.
Surge AI assembled a team of mathematically proficient writers, prioritizing contributors with math or STEM degrees. This background reduced calculation errors, improved writing speed, and enabled more diverse problem designs. All contributors had their initial five submissions peer-reviewed before being accepted onto the full team.
OpenAI established specific criteria for acceptable problems:
Quality assurance involved multiple layers:
Despite these measures, later analysis revealed that roughly 5% of the test set contained errors, ambiguities, or logical inconsistencies, a finding that eventually led to the creation of GSM8K-Platinum (discussed below).
The original GSM8K paper proposed a novel strategy for improving model performance that went beyond standard fine-tuning. Rather than simply training a model to generate correct solutions, the researchers introduced the concept of training a separate verifier model to evaluate the correctness of candidate solutions.
The verification procedure operates as follows:
The key distinction between the verifier approach and standard fine-tuning is that fine-tuning relies on generating a single solution (greedy or low-temperature), while verification leverages test-time compute by generating many solutions and selecting among them. This tradeoff between training-time and inference-time computation became a recurring theme in later AI research.
The paper tested this approach using GPT-3 models of varying sizes and reported results primarily through figures. The following approximate results were extracted from the paper's plots:
| Model Configuration | GSM8K Test Accuracy |
|---|---|
| 6B parameter model, fine-tuned (single sample) | ~20% |
| 6B parameter model, final-answer-only (no intermediate steps) | ~5.2% |
| 175B parameter model, fine-tuned (single sample) | ~55% |
| 6B parameter model with verifier (100 candidates) | Slightly above 175B fine-tuned |
A striking finding was that the 6B verification model slightly outperformed the fine-tuned 175B model on the full training set, delivering a boost roughly equivalent to a 30x increase in model size. The authors also noted that based on their fine-tuning baseline, a model with approximately 10^16 parameters would be needed to reach an 80% solve rate using standard generation methods alone. This underscored the value of the verification approach as an alternative to simply scaling up model parameters.
The paper additionally found that verification scaled more effectively with additional training data compared to the fine-tuning baseline, and that token-level verification outperformed solution-level verification. The importance of intermediate steps was demonstrated clearly: when a 6B model was trained to output only final answers without showing its work, accuracy dropped from approximately 20% to just 5.2%.
GSM8K uses a straightforward evaluation protocol. A model's response is considered correct if and only if it produces the exact final numeric answer. The answer is extracted from the text following the #### delimiter or, in free-form generation, from the last number in the model's response.
Two metrics are commonly used:
Performance on GSM8K varies substantially depending on the prompting strategy used:
| Prompting Strategy | Description | Typical Impact |
|---|---|---|
| Zero-shot | No examples provided | Lower accuracy; models may not format answers correctly |
| Few-shot | 4 to 8 example problems with solutions | Significant accuracy boost; standard evaluation setting |
| Chain-of-thought (CoT) | Few-shot with explicit step-by-step reasoning | Major improvement; became the standard approach after Wei et al. (2022) |
| Self-consistency | Sample multiple CoT paths and take majority vote | Further gains of 10-20 percentage points over single-sample CoT |
| Program-aided (PAL) | Model writes code to solve the problem | Reduces arithmetic errors; competitive with CoT |
The most common evaluation configuration in published benchmarks uses either 5-shot or 8-shot chain-of-thought prompting, or zero-shot prompting for instruction-tuned models that already produce step-by-step reasoning without explicit examples.
GSM8K became deeply intertwined with the development of chain-of-thought (CoT) prompting, one of the most influential techniques in modern prompt engineering. In their 2022 paper "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models," Jason Wei and colleagues at Google demonstrated that providing a few worked-out examples (exemplars) in the prompt could dramatically improve a model's ability to solve multi-step reasoning problems.
Using GSM8K as a primary evaluation benchmark, Wei et al. showed that prompting the PaLM 540B model with just eight chain-of-thought exemplars achieved 56.9% accuracy, surpassing even the fine-tuned GPT-3 with a verifier from the original Cobbe et al. paper in terms of the simplicity of the approach. Before CoT prompting, standard few-shot prompting with PaLM 540B achieved only about 18% on GSM8K. This was a landmark result because it demonstrated that reasoning capabilities could emerge from sufficiently large models through careful prompting alone, without any fine-tuning or specialized training.
Chain-of-thought prompting was found to be an emergent property of model scale. The performance gains appeared only in models with roughly 100 billion parameters or more; smaller models showed little to no improvement from CoT prompting.
Building on the CoT framework, Wang et al. (2023) introduced self-consistency, a decoding strategy that samples multiple diverse reasoning paths and selects the most common final answer through majority voting. Applied to PaLM 540B with chain-of-thought prompting, self-consistency produced a 17.9 percentage point improvement on GSM8K over standard CoT prompting, reaching approximately 74.4% accuracy. This technique demonstrated that there is substantial value in exploring multiple solution paths rather than relying on a single greedy decode.
Google's Minerva model (Lewkowycz et al., 2022), a 540B parameter model fine-tuned specifically on mathematical and scientific data, pushed the state of the art further. Minerva 540B achieved 78.5% accuracy on GSM8K with majority voting, improving upon the previous best of 74.4%. This result demonstrated that combining domain-specific fine-tuning with test-time techniques like self-consistency could yield additional gains.
GSM8K performance has improved dramatically since the benchmark was introduced, tracking the rapid progress of language model capabilities. The following table summarizes notable scores across different models and time periods.
| Model | Organization | Year | GSM8K Accuracy | Method |
|---|---|---|---|---|
| GPT-3 6B (fine-tuned) | OpenAI | 2021 | ~20% | Fine-tuning |
| GPT-3 175B (fine-tuned) | OpenAI | 2021 | ~55% | Fine-tuning |
| PaLM 540B | 2022 | 56.9% | 8-shot CoT | |
| PaLM 540B + Self-Consistency | 2022 | 74.4% | CoT + majority voting | |
| Minerva 540B | 2022 | 78.5% | Majority voting | |
| GPT-3.5 Turbo | OpenAI | 2023 | ~57% | 5-shot CoT |
| Claude 2 | Anthropic | 2023 | 88.0% | 0-shot CoT |
| GPT-4 | OpenAI | 2023 | 92.0% | 5-shot CoT |
| Claude 3 Haiku | Anthropic | 2024 | 88.9% | 0-shot |
| Gemini 1.5 Pro | 2024 | 90.8% | 0-shot | |
| Gemini 1.5 Flash | 2024 | 86.2% | 0-shot | |
| Claude 3 Opus | Anthropic | 2024 | 95.0% | 0-shot |
| Claude 3 Sonnet | Anthropic | 2024 | 92.3% | 0-shot |
| Claude 3.5 Sonnet | Anthropic | 2024 | 96.4% | 0-shot |
| Llama 3.1 405B Instruct | Meta | 2024 | 96.8% | 0-shot |
| Qwen 2.5 72B Instruct | Alibaba | 2024 | 95.8% | 0-shot |
| DeepSeek V2.5 | DeepSeek | 2024 | 95.1% | 0-shot |
| Mistral Large 2 | Mistral AI | 2024 | 93.0% | 0-shot |
| o1 | OpenAI | 2024 | 97.1% | Internal CoT |
| GPT-4.5 | OpenAI | 2025 | 97.0% | 0-shot |
| Kimi K2 Instruct | Moonshot AI | 2025 | 97.3% | 0-shot |
Several patterns are visible in this progression. First, performance jumped significantly between 2021 and 2023 as chain-of-thought prompting, instruction tuning, and reinforcement learning from human feedback were adopted. Second, by 2024, multiple models from different organizations converged above the 95% mark, signaling that the benchmark was approaching saturation for frontier models. Third, the gap between open-weight models (such as Llama 3.1 405B) and proprietary models narrowed considerably.
The history of GSM8K scores can be divided into several distinct phases:
2021: Baseline Era. GPT-3 models set the initial bar. Even the largest 175B parameter model could only solve roughly half the problems with standard fine-tuning. The verification approach showed promise but required generating many candidate solutions.
2022: Chain-of-Thought Revolution. PaLM 540B with CoT prompting achieved 56.9%, matching or exceeding fine-tuned models without any task-specific training. Self-consistency pushed results to 74.4%, and Minerva reached 78.5%. These results demonstrated that prompting techniques and specialized training data could dramatically improve math performance.
2023: GPT-4 Breakthrough. GPT-4 achieved 92.0% with 5-shot CoT, approaching human-level performance for the first time. This result suggested that general-purpose scaling combined with instruction tuning and RLHF could largely solve grade-school math.
2024-2025: Saturation. Multiple models from OpenAI, Anthropic, Meta, Alibaba, and others exceeded 95%. The benchmark could no longer meaningfully differentiate between frontier models. The research community shifted focus to harder benchmarks such as AIME and MATH.
As model scores on GSM8K have climbed above 95%, several criticisms of the benchmark have gained prominence.
By late 2024, nearly all frontier language models scored above 95% on GSM8K, making it difficult to distinguish between top-performing models on this benchmark alone. OpenAI, Anthropic, and Google have all shifted their primary benchmark reporting toward more challenging evaluations such as AIME, MATH, and FrontierMath. Some organizations no longer prominently report GSM8K scores for their latest models.
The saturation problem is compounded by the fact that the benchmark has a hard ceiling: even a perfect reasoner cannot score 100% on the original test set due to label noise (erroneous or ambiguous ground-truth answers). This means that scores above roughly 95% are difficult to interpret, as errors may reflect noise in the benchmark rather than failures in reasoning.
A significant concern is that some models may have been exposed to GSM8K test problems (or very similar problems) during training. The dataset has been publicly available on GitHub and Hugging Face since 2021, and its contents have likely been included in many web-scraped pre-training corpora.
Zhang et al. (2024) investigated this issue systematically in the paper "A Careful Examination of Large Language Model Performance on Grade School Arithmetic," published as a Spotlight paper at NeurIPS 2024. To test for contamination, the researchers commissioned GSM1K, a new dataset of 1,000 grade-school math problems created entirely through manual annotations (without any LLM-generated content) and designed to mirror the style and difficulty of GSM8K while being guaranteed not to appear in any model's training data.
Key findings included:
These findings illustrate an instance of Goodhart's law: when a benchmark becomes a widely tracked target, it risks losing its value as a genuine measure of capability. Models optimized (intentionally or not) for GSM8K may not generalize to novel math problems of equivalent difficulty.
In October 2024, Apple researchers released GSM-Symbolic (Mirzadeh et al., 2024), a variant of GSM8K that uses symbolic templates to generate new problem instances with different numerical values. Their research, published at ICLR 2025, evaluated over 20 open and closed models using 5,000 samples from 100 templates and found that:
The Apple team argued that these results suggest current LLMs replicate reasoning patterns from their training data rather than performing genuine logical reasoning, and that existing benchmark scores may overstate models' true mathematical abilities.
GSM8K evaluates only the correctness of the final numeric answer, disregarding the reasoning process. A model can arrive at the correct answer through flawed logic or lucky cancellation of errors and still receive full credit. This limitation means the benchmark does not reliably assess whether a model truly understands mathematical reasoning or merely pattern-matches toward correct answers. The MR-GSM8K benchmark (Zhu et al., 2023) was created specifically to address this issue by requiring models to reason about reasoning steps themselves.
Approximately 5% of the original GSM8K test set contained errors, including ambiguous problem statements, logical inconsistencies, and mislabeled answers. This label noise means that even a perfect reasoner would be unable to score 100% on the original test set, and it introduces noise into model comparisons near the top of the leaderboard.
To address the label noise problem, researchers at MIT's Madry Lab created GSM8K-Platinum, a cleaned version of the GSM8K test set. The work was led by Edward Vendrow, Joshua Vendrow, Aleksander Madry, and Sara Beery, and was released on March 6, 2025.
The team ran multiple frontier LLMs on the GSM8K test set and flagged every question where any model's answer disagreed with the stated ground truth. This process identified 219 potentially problematic questions out of the 1,319 total test questions. Each flagged question was then manually inspected, resulting in:
| Action | Count |
|---|---|
| Questions removed (ambiguous or logically inconsistent) | 110 |
| Questions verified as correct | 99 |
| Questions with corrected answers | 10 |
No modifications were made to the question wording itself; only removals and answer corrections were applied. The resulting GSM8K-Platinum test set serves as a drop-in replacement for the original GSM8K test set and is available on Hugging Face (madrylab/gsm8k-platinum).
GSM8K-Platinum exposed meaningful performance gaps between models that appeared identical on the original benchmark. For example:
| Model | Errors on GSM8K | Errors on GSM8K-Platinum |
|---|---|---|
| Claude 3.7 Sonnet (extended thinking) | 45 | 2 |
| Llama 405B | 45 | 17 |
Both models made the same number of errors (45) on the original GSM8K test set. However, on the cleaned GSM8K-Platinum, Claude 3.7 Sonnet with extended thinking made only 2 genuine errors compared to Llama 405B's 17 errors. This demonstrated that the original benchmark's label noise had been masking real differences in model capability. The Madry Lab noted that Claude 3.7 Sonnet with extended thinking was released almost a year after Llama 405B and significantly outperforms it on other math benchmarks, yet this advantage was completely obscured in the original dataset. The finding supports the argument that the apparent "plateauing" of frontier models at around 95% accuracy on GSM8K was in large part caused by label noise rather than a genuine ceiling in model performance.
GSM8K occupies one position in a broader ecosystem of mathematical reasoning benchmarks. The following table compares several widely used alternatives.
| Benchmark | Introduced | Problems | Difficulty Level | Description |
|---|---|---|---|---|
| GSM8K | 2021 | 8,792 | Grade school | Multi-step arithmetic word problems requiring 2-8 steps |
| MATH | 2021 | 12,500 | Competition (high school) | Seven subjects including algebra, number theory, and geometry; competition-level difficulty |
| MGSM | 2022 | 250 per language | Grade school (multilingual) | 250 GSM8K problems translated into 10 languages including Bengali, Chinese, French, German, Japanese, Russian, Spanish, Swahili, Telugu, and Thai |
| MathVista | 2023 | 6,141 | Mixed | Evaluates mathematical reasoning in visual contexts across five task types |
| MR-GSM8K | 2023 | 1,319 | Grade school (meta-reasoning) | Requires models to reason about reasoning steps, not just solve problems |
| GSM1K | 2024 | 1,000 | Grade school | Contamination-free mirror of GSM8K created with manual annotations for overfitting detection |
| GSM-Symbolic | 2024 | Variable | Grade school (parameterized) | Symbolic templates with variable numbers testing robustness |
| AIME | Adapted 2024 | ~30/year | Olympiad (high school) | Problems from the American Invitational Mathematics Examination |
| FrontierMath | 2024 | Hundreds | Research-level | Original problems spanning most major branches of modern mathematics; created by Epoch AI with 60+ expert mathematicians |
| GSM8K-Platinum | 2025 | ~1,209 | Grade school | Cleaned version of GSM8K test set with label noise removed |
The MATH benchmark (Hendrycks et al., 2021) contains 12,500 problems drawn from American math competitions including the AMC and AIME. These problems require advanced skills in algebra, geometry, number theory, counting, and probability. While GSM8K problems can be solved with basic arithmetic, MATH problems demand creative problem-solving techniques. Both benchmarks were released in 2021, and they are frequently reported together as complementary measures of mathematical reasoning at different difficulty levels.
The American Invitational Mathematics Examination (AIME) has emerged as a preferred benchmark for evaluating frontier reasoning models on difficult mathematics. OpenAI's o3 model scored 96.7% on AIME 2024, demonstrating capabilities well beyond what GSM8K can measure. The shift from GSM8K to AIME-level benchmarks reflects the broader trend of models outgrowing elementary math evaluations.
Multilingual Grade School Math (MGSM) translates 250 GSM8K problems into 10 typologically diverse languages. MGSM evaluates whether mathematical reasoning capabilities transfer across languages or remain primarily English-centric, revealing significant capability gaps in lower-resource languages for many models.
GSM8K's influence on the field of AI research extends well beyond its role as a leaderboard.
GSM8K provided the testing ground for some of the most important prompting innovations of the 2020s. Chain-of-thought prompting, self-consistency, process reward models, and various verification strategies were all evaluated primarily on GSM8K. The benchmark's moderate difficulty made it ideal for this purpose: it was hard enough to show meaningful differences between methods but not so hard that most approaches scored near zero.
The verifier approach introduced in the original GSM8K paper laid the groundwork for later research on process reward models (PRMs) and outcome reward models (ORMs). OpenAI's subsequent work on process supervision, which involves training models to evaluate the correctness of each individual reasoning step rather than just the final answer, drew directly from the verification framework developed for GSM8K. This line of research has become central to how reasoning models like OpenAI's o1 and o3 are trained.
The original paper's finding that generating multiple candidate solutions and selecting the best one (test-time compute scaling) could substitute for massive increases in model size foreshadowed a broader trend in AI research. The concept of spending more computation at inference time, rather than solely at training time, has become a key design principle behind reasoning-focused models. OpenAI's o-series models represent the most prominent application of this principle, using extended internal chain-of-thought reasoning at inference time to improve accuracy on challenging problems.
GSM8K helped establish the convention of reporting benchmark scores in model release announcements and technical reports. Alongside MMLU and HumanEval, GSM8K became part of the standard trio of benchmarks that virtually every major language model was evaluated on from 2022 through 2024.
The issues that emerged with GSM8K, including data contamination, label noise, and saturation, have provided valuable lessons for the design of future benchmarks. The creation of GSM1K, GSM-Symbolic, and GSM8K-Platinum all represent direct responses to limitations identified in the original benchmark. These efforts have pushed the community toward practices such as using held-out problem sets, template-based problem generation, systematic error auditing, and designing benchmarks with higher difficulty ceilings.
GSM8K is freely available under the MIT License. The primary distribution channels are:
The repository includes the training and test data in both the main and Socratic configurations, along with example model solutions from 6B and 175B parameter models and reference code for answer extraction and evaluation. The total dataset size is approximately 5.89 MB.