# GSM8K

> Source: https://aiwiki.ai/wiki/gsm8k
> Updated: 2026-06-20
> Categories: AI Benchmarks, Large Language Models, Machine Learning, Reasoning Models
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**GSM8K** (Grade School Math 8K) is a benchmark dataset of 8,792 grade-school-level math word problems created by researchers at [OpenAI](/wiki/openai) to evaluate the multi-step mathematical reasoning capabilities of [large language models](/wiki/large_language_model). The original 2021 paper introduced it as "a dataset of 8.5K high quality linguistically diverse grade school math word problems," and the round figure of 8.5K is still how the benchmark is often cited.[^1] It was introduced in the paper "Training Verifiers to Solve Math Word Problems" by Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, [Jerry Tworek](/wiki/jerry_tworek), Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. GSM8K has become one of the most widely used benchmarks for measuring how well language models handle arithmetic reasoning tasks that require multiple sequential steps.[^1]

Each problem requires between 2 and 8 steps of elementary arithmetic (addition, subtraction, multiplication, and division), and the dataset is split into 7,473 training problems and 1,319 test problems.[^1] The benchmark played a central role in the development of [chain-of-thought prompting](/wiki/chain_of_thought) and verification-based inference strategies. Its influence extends across hundreds of research papers, and it remains a standard evaluation metric reported in model technical reports and academic publications, even as frontier models have effectively saturated the benchmark by 2025-2026.

## What is GSM8K used for?

GSM8K is used to measure a language model's ability to perform multi-step mathematical reasoning on problems that are conceptually simple for humans but require a chain of sequential calculations. It serves three main roles in practice: as a primary evaluation benchmark in model technical reports and academic papers, as a testbed for new prompting and inference techniques (chain-of-thought, self-consistency, program-aided reasoning, and verifier-based selection), and as a lightweight "smoke test" that confirms a model can follow a reasoning chain to a correct numeric answer. Because each problem has a single integer answer, scoring is fully automatic.

## Background and Motivation

Before GSM8K was introduced, existing math benchmarks for language models were limited in scope and difficulty calibration. Many datasets either contained single-step arithmetic problems that failed to test genuine reasoning, or featured competition-level mathematics that was too difficult to provide meaningful signal for most models. The OpenAI team recognized the need for a benchmark that occupied a middle ground: problems that were conceptually straightforward for humans but required multiple reasoning steps, making them genuinely challenging for language models. As the authors put it in the paper's abstract, "State-of-the-art language models can match human performance on many tasks, but they still struggle to robustly perform multi-step mathematical reasoning."[^1]

The core insight behind GSM8K was that even the largest transformer models at the time struggled to solve problems that a bright middle school student could handle. This gap between human capability and model performance on basic multi-step math highlighted a fundamental limitation in how language models approached sequential reasoning tasks. By creating a carefully curated dataset of such problems, the researchers aimed to provide a clear, measurable target for improvement.

The timing of GSM8K's release coincided with growing interest in scaling laws and emergent abilities of large language models. Researchers needed benchmarks that could expose specific weaknesses in model reasoning, and grade school math provided an accessible yet challenging domain for this purpose.

## Dataset Composition

GSM8K contains 8,792 problems in total, divided into a training set of 7,473 problems and a test set of 1,319 problems. All problems were written by human contributors and underwent rigorous quality control.[^1]

### Problem Characteristics

Each problem in GSM8K is a natural language word problem that requires between 2 and 8 steps to solve. Solutions involve only basic arithmetic operations: addition, subtraction, multiplication, and division. The problems are designed so that all intermediate calculations are manageable without a calculator (for example, multiplying 7 by 8 or adding 36 and 110). Every problem has a single integer as its final answer.

The problems cover a range of everyday scenarios involving money, quantities, time, distances, and other practical contexts. They are linguistically diverse, meaning they use varied sentence structures and vocabulary rather than following a single template.

### Example Problem

A typical GSM8K problem looks like this:

> **Question:** Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?
>
> **Answer:** Natalia sold 48 / 2 = <<48/2=24>>24 clips in May. Natalia sold 48 + 24 = <<48+24=72>>72 clips altogether in April and May. #### 72

The answer format includes natural language explanations interleaved with calculator annotations (the `<<expression=result>>` notation), followed by the delimiter `####` and the final numeric answer. This structured format allows automated scoring by simply parsing the number after the `####` token.

### Data Format

The dataset is stored as JSON Lines (`.jsonl`) files, with each line containing a dictionary with two keys: `"question"` and `"answer"`. The dataset is available in two configurations:

| Configuration | Description | Train | Test |
|---|---|---|---|
| Main | Standard question-answer pairs | 7,473 | 1,319 |
| Socratic | Includes auto-generated Socratic subquestions before each solution step | 7,473 | 1,319 |

The Socratic variant includes additional guiding subquestions (such as "How many clips did Natalia sell in May?") prepended to each reasoning step. These subquestions were generated by a specialized fine-tuned model trained on approximately 800 examples.

## Dataset Creation Process

The problems in GSM8K were created by Surge AI in partnership with OpenAI's reinforcement learning team. The creation process involved several deliberate steps to ensure quality and diversity.[^13]

### Writer Selection

Surge AI assembled a team of mathematically proficient writers, prioritizing contributors with math or STEM degrees. This background reduced calculation errors, improved writing speed, and enabled more diverse problem designs. All contributors had their initial five submissions peer-reviewed before being accepted onto the full team.

### Guidelines

OpenAI established specific criteria for acceptable problems:

- Solutions must require 2 to 8 reasoning steps
- All intermediate calculations should be mentally manageable
- Final answers must be single integers
- Problems should show explicit calculation steps rather than just stating results
- Only elementary arithmetic operations are permitted
- No repetition of problem scenarios across the dataset

### Quality Control

Quality assurance involved multiple layers:

- **Duplicate prevention:** Sentence [embeddings](/wiki/embeddings) were computed for all problems, and pairs exceeding cosine similarity thresholds were flagged and eliminated.
- **Mathematical accuracy:** Two independent reviewers solved each problem to catch ambiguities and errors. Any discrepancies triggered a careful review of the problem wording to ensure only a single correct interpretation existed.
- **Diversity checks:** The team monitored the dataset for repetitive patterns and encouraged varied problem scenarios.

The original paper reported that this process yielded an estimated error rate below 2 percent for the final-answer labels.[^1] Despite these measures, later independent analysis revealed that roughly 5% of the test set contained errors, ambiguities, or logical inconsistencies, a finding that eventually led to the creation of GSM8K-Platinum (discussed below).

## The Verifier Approach

The original GSM8K paper proposed a novel strategy for improving model performance that went beyond standard [fine-tuning](/wiki/fine_tuning). Rather than simply training a model to generate correct solutions, the researchers introduced the concept of training a separate **verifier** model to evaluate the correctness of candidate solutions. As the abstract states, "To increase performance, we propose training verifiers to judge the correctness of model completions," adding that "verification scales more effectively with increased data than a finetuning baseline."[^1]

### How Verification Works

The verification procedure operates as follows:

1. **Candidate generation:** Given a math problem, the primary model (the generator) produces a large number of candidate solutions (for example, 100 candidates per problem) by sampling at high temperature.
2. **Labeling:** Each candidate is labeled as correct or incorrect based on whether its final numeric answer matches the ground truth.
3. **Verifier training:** A separate transformer model is trained to estimate the probability that a given solution is correct. The verifier takes the concatenation of the problem and a candidate solution as input and is optimized with both a verification loss and the standard language modeling loss.
4. **Selection:** At test time, the model generates multiple candidate solutions, and the verifier selects the highest-ranked one.

The key distinction between the verifier approach and standard fine-tuning is that fine-tuning relies on generating a single solution (greedy or low-temperature), while verification leverages test-time compute by generating many solutions and selecting among them. This tradeoff between training-time and inference-time computation became a recurring theme in later AI research and a direct precursor to the [process reward model](/wiki/process_reward_model) (PRM) framework now used to train reasoning models.

### Results from the Original Paper

The paper tested this approach using [GPT-3](/wiki/gpt-3) models of varying sizes and reported results primarily through figures. The following approximate results were extracted from the paper's plots:

| Model Configuration | GSM8K Test Accuracy |
|---|---|
| 6B parameter model, fine-tuned (single sample) | ~20% |
| 6B parameter model, final-answer-only (no intermediate steps) | ~5.2% |
| 175B parameter model, fine-tuned (single sample) | ~55% |
| 6B parameter model with verifier (100 candidates) | Slightly above 175B fine-tuned |

A striking finding was that the 6B verification model slightly outperformed the fine-tuned 175B model on the full training set, delivering a boost roughly equivalent to a 30x increase in model size.[^1] The authors also noted that based on their fine-tuning baseline, a model with approximately 10^16 parameters would be needed to reach an 80% solve rate using standard generation methods alone. This underscored the value of the verification approach as an alternative to simply scaling up model parameters.

The paper additionally found that verification scaled more effectively with additional training data compared to the fine-tuning baseline, and that token-level verification outperformed solution-level verification. The importance of intermediate steps was demonstrated clearly: when a 6B model was trained to output only final answers without showing its work, accuracy dropped from approximately 20% to just 5.2%.

## Evaluation Methodology

GSM8K uses a straightforward evaluation protocol. A model's response is considered correct if and only if it produces the exact final numeric answer. The answer is extracted from the text following the `####` delimiter or, in free-form generation, from the last number in the model's response.

Two metrics are commonly used:

- **test@1** (or **pass@1**): Accuracy when only one solution is generated per problem (typically with greedy or low-temperature decoding)
- **test@N** (or **maj@N**): Accuracy when N candidate solutions are sampled and the best is selected (either by a verifier or by majority voting / self-consistency)

### Prompting Strategies

Performance on GSM8K varies substantially depending on the prompting strategy used:

| Prompting Strategy | Description | Typical Impact |
|---|---|---|
| Zero-shot | No examples provided | Lower accuracy; models may not format answers correctly |
| Few-shot | 4 to 8 example problems with solutions | Significant accuracy boost; standard evaluation setting |
| [Chain-of-thought](/wiki/chain_of_thought) (CoT) | Few-shot with explicit step-by-step reasoning | Major improvement; became the standard approach after Wei et al. (2022) |
| Self-consistency | Sample multiple CoT paths and take majority vote | Further gains of 10-20 percentage points over single-sample CoT |
| Program-aided (PAL) | Model writes Python code; an interpreter executes it | Reduces arithmetic errors; competitive with CoT |
| Program of Thoughts (PoT) | Model expresses reasoning as a program executed externally | ~12% average improvement over CoT across numerical reasoning datasets |

The most common evaluation configuration in published benchmarks uses either 5-shot or 8-shot chain-of-thought prompting, or zero-shot prompting for instruction-tuned models that already produce step-by-step reasoning without explicit examples.

## Impact on Chain-of-Thought Prompting

GSM8K became deeply intertwined with the development of chain-of-thought (CoT) prompting, one of the most influential techniques in modern [prompt engineering](/wiki/prompt_engineering). In their 2022 paper "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models," Jason Wei and colleagues at [Google](/wiki/google) demonstrated that providing a few worked-out examples (exemplars) in the prompt could dramatically improve a model's ability to solve multi-step reasoning problems.[^2]

Using GSM8K as a primary evaluation benchmark, Wei et al. showed that prompting the [PaLM](/wiki/palm) 540B model with just eight chain-of-thought exemplars achieved 56.9% accuracy, surpassing even the fine-tuned GPT-3 with a verifier from the original Cobbe et al. paper in terms of the simplicity of the approach.[^10] Before CoT prompting, standard few-shot prompting with PaLM 540B achieved only about 18% on GSM8K. This was a landmark result because it demonstrated that reasoning capabilities could emerge from sufficiently large models through careful prompting alone, without any fine-tuning or specialized training.

Chain-of-thought prompting was found to be an emergent property of model scale. The performance gains appeared only in models with roughly 100 billion parameters or more; smaller models showed little to no improvement from CoT prompting.

### Self-Consistency

Building on the CoT framework, Wang et al. (2023) introduced **self-consistency**, a decoding strategy that samples multiple diverse reasoning paths and selects the most common final answer through majority voting. Applied to PaLM 540B with chain-of-thought prompting, self-consistency produced a 17.9 percentage point improvement on GSM8K over standard CoT prompting, reaching approximately 74.4% accuracy.[^3] This technique demonstrated that there is substantial value in exploring multiple solution paths rather than relying on a single greedy decode.

### Minerva

Google's Minerva model (Lewkowycz et al., 2022), a 540B parameter model fine-tuned specifically on mathematical and scientific data, pushed the state of the art further. Minerva 540B achieved 78.5% accuracy on GSM8K with majority voting, improving upon the previous best of 74.4%.[^4] This result demonstrated that combining domain-specific fine-tuning with test-time techniques like self-consistency could yield additional gains.

### Program-Aided Approaches

Two closely related papers from late 2022 explored a different strategy: offloading arithmetic to an external interpreter. **PAL (Program-Aided Language Models)** by Gao, Madaan, Zhou, Alon, Liu, Yang, Callan, and Neubig (2022)[^14] and **Program of Thoughts (PoT)** by Chen, Ma, Wang, and Cohen (2022)[^15] both demonstrated that prompting a code-capable model (such as Codex) to emit a Python program and then executing that program could substantially reduce arithmetic errors. On GSM8K, PAL with Codex achieved roughly 72% accuracy, and PoT achieved comparable results. The PAL paper additionally introduced the **GSM-Hard** dataset, constructed by replacing the numbers in GSM8K with larger, less common values to stress-test arithmetic robustness; PAL outperformed standard CoT by roughly 40 absolute percentage points on GSM-Hard.

## Historical Performance and Model Scores

GSM8K performance has improved dramatically since the benchmark was introduced, tracking the rapid progress of language model capabilities. The following table summarizes notable scores across different models and time periods.

### GSM8K Accuracy by Model

| Model | Organization | Year | GSM8K Accuracy | Method |
|---|---|---|---|---|
| GPT-3 6B (fine-tuned) | [OpenAI](/wiki/openai) | 2021 | ~20% | Fine-tuning |
| GPT-3 175B (fine-tuned) | [OpenAI](/wiki/openai) | 2021 | ~55% | Fine-tuning |
| [PaLM](/wiki/palm) 540B | [Google](/wiki/google) | 2022 | 56.9% | 8-shot CoT |
| PaLM 540B + Self-Consistency | [Google](/wiki/google) | 2022 | 74.4% | CoT + majority voting |
| Minerva 540B | [Google](/wiki/google) | 2022 | 78.5% | Majority voting |
| GPT-3.5 Turbo | [OpenAI](/wiki/openai) | 2023 | ~57% | 5-shot CoT |
| Claude 2 | [Anthropic](/wiki/anthropic) | 2023 | 88.0% | 0-shot CoT |
| [GPT-4](/wiki/gpt-4) | [OpenAI](/wiki/openai) | 2023 | 92.0% | 5-shot CoT |
| Gemini 1.5 Pro | [Google](/wiki/google) | 2024 | 90.8% | 0-shot |
| Gemini 1.5 Flash | [Google](/wiki/google) | 2024 | 86.2% | 0-shot |
| [Claude 3 Opus](/wiki/claude_3_opus) | [Anthropic](/wiki/anthropic) | 2024 | 95.0% | 0-shot |
| [Claude 3.5 Sonnet](/wiki/claude_3_5_sonnet) | [Anthropic](/wiki/anthropic) | 2024 | 96.4% | 0-shot |
| [GPT-4o](/wiki/gpt_4o) | [OpenAI](/wiki/openai) | 2024 | ~95.6% | 0-shot CoT |
| [Llama 3.1](/wiki/llama_3_1) 405B Instruct | [Meta](/wiki/meta) | 2024 | 96.8% | 0-shot |
| [Qwen](/wiki/qwen) 2.5 72B Instruct | Alibaba | 2024 | 95.8% | 0-shot |
| [DeepSeek V3](/wiki/deepseek_v3) | DeepSeek | 2024 | ~89.3% | 0-shot CoT |
| Mistral Large 2 | [Mistral AI](/wiki/mistral) | 2024 | 93.0% | 0-shot |
| [OpenAI o1](/wiki/o1) | [OpenAI](/wiki/openai) | 2024 | 96.4% | Internal CoT |
| [DeepSeek-R1](/wiki/deepseek_r1) | DeepSeek | 2025 | ~95.5% / 96.1% | RL-trained reasoning |
| [Claude 3.7 Sonnet](/wiki/claude_3_7_sonnet) | [Anthropic](/wiki/anthropic) | 2025 | 96.6% (45 errors) | Extended thinking |
| GPT-4.5 | [OpenAI](/wiki/openai) | 2025 | 97.0% | 0-shot |
| [Kimi K2](/wiki/kimi_k2) Instruct | [Moonshot AI](/wiki/moonshot_ai) | 2025 | 97.3% | 0-shot |
| [OpenAI o3](/wiki/o3) | [OpenAI](/wiki/openai) | 2025 | Near-saturated (rarely reported) | Reasoning |
| [Claude 4](/wiki/claude_4) | [Anthropic](/wiki/anthropic) | 2025 | Not separately reported | Reasoning |
| [GPT-5](/wiki/gpt-5) | [OpenAI](/wiki/openai) | 2025 | Not separately reported | Reasoning |

Several patterns are visible in this progression. First, performance jumped significantly between 2021 and 2023 as chain-of-thought prompting, instruction tuning, and [reinforcement learning from human feedback](/wiki/rlhf) (RLHF) were adopted. Second, by 2024, multiple models from different organizations converged above the 95% mark, signaling that the benchmark was approaching saturation for frontier models. Third, the gap between open-weight models (such as Llama 3.1 405B) and proprietary models narrowed considerably. Fourth, by 2025, major labs increasingly stopped reporting GSM8K in their flagship model release announcements because the scores no longer differentiated between top systems; AIME, MATH-500, and [FrontierMath](/wiki/frontiermath) replaced it as the primary math reporting benchmarks.

### Key Milestones

The history of GSM8K scores can be divided into several distinct phases:

**2021: Baseline Era.** GPT-3 models set the initial bar. Even the largest 175B parameter model could only solve roughly half the problems with standard fine-tuning. The verification approach showed promise but required generating many candidate solutions.

**2022: Chain-of-Thought Revolution.** PaLM 540B with CoT prompting achieved 56.9%, matching or exceeding fine-tuned models without any task-specific training. Self-consistency pushed results to 74.4%, and Minerva reached 78.5%. PAL and PoT introduced program-aided approaches. These results demonstrated that prompting techniques and specialized training data could dramatically improve math performance.

**2023: GPT-4 Breakthrough.** GPT-4 achieved 92.0% with 5-shot CoT, approaching human-level performance for the first time.[^5] This result suggested that general-purpose scaling combined with instruction tuning and RLHF could largely solve grade-school math.

**2024: Saturation and Open-Weight Parity.** Multiple models from OpenAI, Anthropic, Meta, Alibaba, and DeepSeek exceeded 95%. The release of OpenAI's o1 introduced the first widely deployed [reasoning model](/wiki/reasoning_models), demonstrating that test-time compute scaling via long internal chain-of-thought could deliver further gains on harder benchmarks.[^11]

**2025-2026: Effective Saturation.** DeepSeek-R1, Kimi K2, GPT-4.5, Claude 3.7 Sonnet, Claude 4, and GPT-5 all sit at or near the apparent ceiling of GSM8K.[^17] The research community has largely moved to [AIME 2024](/wiki/aime_2024), [AIME 2025](/wiki/aime_2025), MATH-500, [FrontierMath](/wiki/frontiermath), and other harder benchmarks. GSM8K is now most commonly used as a smoke test or as a component of broader [LLM benchmark](/wiki/benchmarks) timelines.

## Benchmark Saturation and Criticisms

As model scores on GSM8K have climbed above 95%, several criticisms of the benchmark have gained prominence.

### Is GSM8K still a useful benchmark?

By late 2024, nearly all frontier language models scored above 95% on GSM8K, making it difficult to distinguish between top-performing models on this benchmark alone. OpenAI, Anthropic, and Google have all shifted their primary benchmark reporting toward more challenging evaluations such as AIME, MATH-500, and FrontierMath. Some organizations no longer prominently report GSM8K scores for their latest models.

The saturation problem is compounded by the fact that the benchmark has a hard ceiling: even a perfect reasoner cannot score 100% on the original test set due to label noise (erroneous or ambiguous ground-truth answers). This means that scores above roughly 95% are difficult to interpret, as errors may reflect noise in the benchmark rather than failures in reasoning. As a result, GSM8K remains useful mainly as a fast sanity check on basic reasoning and answer formatting rather than as a way to rank top systems.

### Is GSM8K contaminated in training data?

A significant concern is that some models may have been exposed to GSM8K test problems (or very similar problems) during training. The dataset has been publicly available on GitHub and [Hugging Face](/wiki/hugging_face) since 2021, and its contents have likely been included in many web-scraped pre-training corpora.

Zhang et al. (2024) investigated this issue systematically in the paper "A Careful Examination of Large Language Model Performance on Grade School Arithmetic," published as a Spotlight paper at [NeurIPS](/wiki/neurips) 2024.[^6] To test for contamination, the researchers commissioned **GSM1K**, a new dataset of 1,000 grade-school math problems created entirely through manual annotations (without any LLM-generated content) and designed to mirror the style and difficulty of GSM8K while being guaranteed not to appear in any model's training data.

Key findings included:

- Accuracy drops of up to 13% were observed when comparing model performance on GSM8K versus GSM1K
- Several model families showed evidence of systematic overfitting across almost all model sizes
- A positive correlation (Spearman's r-squared = 0.36) was found between a model's probability of generating an example from GSM8K verbatim and its performance gap between GSM8K and GSM1K
- Microsoft's Phi-3 showed an almost 10% accuracy drop between GSM8K and GSM1K
- [Frontier models](/wiki/frontier_models) from OpenAI and Anthropic generally showed minimal signs of overfitting

These findings illustrate an instance of Goodhart's law: when a benchmark becomes a widely tracked target, it risks losing its value as a genuine measure of capability. Models optimized (intentionally or not) for GSM8K may not generalize to novel math problems of equivalent difficulty.

### GSM-Symbolic and GSM-NoOp

In October 2024, Apple researchers released **GSM-Symbolic** (Mirzadeh, Alizadeh, Shahrokhi, Tuzel, Bengio, and Farajtabar, 2024), a variant of GSM8K that uses symbolic templates to generate new problem instances with different numerical values and entity names.[^7] Their research, published at ICLR 2025, evaluated more than 20 open and closed models using 5,000 samples from 100 templates and found that:

- Performance declined for all tested models when only the numerical values in problems were changed, suggesting sensitivity to specific numbers rather than understanding of mathematical relationships.
- Variance across instantiations of the same template was non-trivial, indicating that a single GSM8K-style score may not reliably reflect underlying capability.
- Performance degraded significantly as the number of clauses in a problem increased. The authors observed that model "performance significantly deteriorates as the number of clauses in a question increases."[^7]

The same paper introduced **GSM-NoOp**, a variant in which a single seemingly-relevant but logically-irrelevant clause is appended to each problem. The authors report that "adding a single clause that seems relevant to the question causes significant performance drops (up to 65%) across all state-of-the-art models," including OpenAI's o1-preview and o1-mini, despite the inserted text contributing nothing to the reasoning chain needed for the answer.[^7] The Apple team argued that these results suggest current LLMs replicate reasoning patterns from their training data rather than performing formal logical reasoning, hypothesizing that "current LLMs cannot perform genuine logical reasoning; they replicate reasoning steps from their training data."[^7] On this view, existing benchmark scores may overstate models' true mathematical abilities. The GSM-Symbolic and GSM-NoOp datasets are publicly released at github.com/apple/ml-gsm-symbolic.

### GSM-Plus

A parallel robustness study by Li, Cui, Zhao, Kong, and Bi published at ACL 2024 introduced **GSM-Plus**, which expands each of the 1,319 GSM8K test problems into eight perturbed variants, yielding 10,552 total questions.[^16] The perturbations span five categories: numerical variation, arithmetic variation (reversing or adding operations), problem-understanding (rephrasing), distractor insertion, and critical thinking (questions missing necessary information). Evaluating 25 LLMs and 4 prompting techniques, the authors found that even for problems the model solves correctly on GSM8K, performance frequently collapses when new statements are added or the question target is altered. GSM-Plus has become a complementary robustness benchmark alongside GSM-Symbolic.

### Final-Answer-Only Evaluation

GSM8K evaluates only the correctness of the final numeric answer, disregarding the reasoning process. A model can arrive at the correct answer through flawed logic or lucky cancellation of errors and still receive full credit. This limitation means the benchmark does not reliably assess whether a model truly understands mathematical reasoning or merely pattern-matches toward correct answers. The MR-GSM8K benchmark (Zhu et al., 2023)[^12] was created specifically to address this issue by requiring models to reason about reasoning steps themselves.

### Label Noise

Approximately 5% of the original GSM8K test set contained errors, including ambiguous problem statements, logical inconsistencies, and mislabeled answers. This label noise means that even a perfect reasoner would be unable to score 100% on the original test set, and it introduces noise into model comparisons near the top of the leaderboard.

## GSM8K-Platinum

To address the label noise problem, researchers at MIT's Madry Lab created **GSM8K-Platinum**, a cleaned version of the GSM8K test set. The work was led by Edward Vendrow, Joshua Vendrow, Aleksander Madry, and Sara Beery, and was released on March 6, 2025.[^8]

### How was GSM8K-Platinum cleaned?

The team ran multiple frontier LLMs on the GSM8K test set and flagged every question where any model's answer disagreed with the stated ground truth. According to the authors, "we then manually inspected the 219 flagged questions, of which 110 were removed, 99 were verified, and 10 had mislabeled answers that were corrected."[^8] This process is summarized below:

| Action | Count |
|---|---|
| Questions removed (ambiguous or logically inconsistent) | 110 |
| Questions verified as correct | 99 |
| Questions with corrected answers | 10 |

No modifications were made to the question wording itself; only removals and answer corrections were applied. The resulting GSM8K-Platinum test set serves as a drop-in replacement for the original GSM8K test set and is available on Hugging Face (`madrylab/gsm8k-platinum`).

### Revealed Performance Differences

GSM8K-Platinum exposed meaningful performance gaps between models that appeared identical on the original benchmark. For example:

| Model | Errors on GSM8K | Errors on GSM8K-Platinum |
|---|---|---|
| Claude 3.7 Sonnet (extended thinking) | 45 | 2 |
| Llama 405B | 45 | 17 |

Both models made the same number of errors (45) on the original GSM8K test set. However, on the cleaned GSM8K-Platinum, [Claude 3.7 Sonnet](/wiki/claude_3_7_sonnet) with extended thinking made only 2 genuine errors compared to [Llama 405B](/wiki/llama_3_1)'s 17 errors.[^8] This demonstrated that the original benchmark's label noise had been masking real differences in model capability; as the Madry Lab put it, "this performance difference was obscured in the original benchmark due to noise."[^8] The lab noted that Claude 3.7 Sonnet with extended thinking was released almost a year after Llama 405B and significantly outperforms it on other math benchmarks, yet this advantage was completely obscured in the original dataset. The finding supports the argument that the apparent "plateauing" of frontier models at around 95% accuracy on GSM8K was "in large part caused by label noise" rather than a genuine ceiling in model performance.[^8]

## Related Benchmarks

GSM8K occupies one position in a broader ecosystem of mathematical reasoning benchmarks. The following table compares several widely used alternatives.

| Benchmark | Introduced | Problems | Difficulty Level | Description |
|---|---|---|---|---|
| GSM8K | 2021 | 8,792 | Grade school | Multi-step arithmetic word problems requiring 2-8 steps |
| [MATH](/wiki/math_benchmark) | 2021 | 12,500 | Competition (high school) | Seven subjects including algebra, number theory, and geometry; competition-level difficulty |
| GSM-Hard | 2022 | 1,319 | Grade school (larger numbers) | GSM8K with numbers replaced by larger values; introduced in the PAL paper |
| [MGSM](/wiki/mgsm) | 2022 | 250 per language | Grade school (multilingual) | 250 GSM8K problems translated into 10 languages including Bengali, Chinese, French, German, Japanese, Russian, Spanish, Swahili, Telugu, and Thai |
| [MathVista](/wiki/mathvista) | 2023 | 6,141 | Mixed | Evaluates mathematical reasoning in visual contexts across five task types |
| MR-GSM8K | 2023 | 1,319 | Grade school (meta-reasoning) | Requires models to reason about reasoning steps, not just solve problems |
| GSM-Plus | 2024 | 10,552 | Grade school (perturbed) | Each GSM8K test problem expanded into 8 adversarial variants across 5 perturbation types |
| GSM1K | 2024 | 1,000 | Grade school | Contamination-free mirror of GSM8K created with manual annotations for overfitting detection |
| GSM-Symbolic | 2024 | Variable (templated) | Grade school (parameterized) | Symbolic templates with variable numbers and entities testing robustness |
| GSM-NoOp | 2024 | Variable (templated) | Grade school (distractor) | GSM-Symbolic problems plus one irrelevant clause; designed to expose pattern-matching |
| [AIME](/wiki/aime) | Adapted 2024 | ~30/year | Olympiad (high school) | Problems from the American Invitational Mathematics Examination |
| [FrontierMath](/wiki/frontiermath) | 2024 | Hundreds | Research-level | Original problems spanning most major branches of modern mathematics; created by Epoch AI with 60+ expert mathematicians |
| GSM8K-Platinum | 2025 | ~1,209 | Grade school | Cleaned version of GSM8K test set with label noise removed |

### How does GSM8K differ from the MATH benchmark?

The [MATH benchmark](/wiki/math_benchmark) (Hendrycks et al., 2021)[^9] contains 12,500 problems drawn from American math competitions including the AMC and AIME. These problems require advanced skills in algebra, geometry, number theory, counting, and probability. While GSM8K problems can be solved with basic arithmetic, MATH problems demand creative problem-solving techniques. Both benchmarks were released in 2021, and they are frequently reported together as complementary measures of mathematical reasoning at different difficulty levels. A 500-problem subset (MATH-500) has become a standard frontier evaluation.

### AIME

The American Invitational Mathematics Examination ([AIME](/wiki/aime)) has emerged as a preferred benchmark for evaluating frontier reasoning models on difficult mathematics. [OpenAI's o3](/wiki/o3) model scored very high on [AIME 2024](/wiki/aime_2024) (96.7% with selection methods reported by OpenAI), demonstrating capabilities well beyond what GSM8K can measure. The shift from GSM8K to AIME-level benchmarks reflects the broader trend of models outgrowing elementary math evaluations.

### MGSM

[Multilingual Grade School Math (MGSM)](/wiki/mgsm) translates 250 GSM8K problems into 10 typologically diverse languages. MGSM evaluates whether mathematical reasoning capabilities transfer across languages or remain primarily English-centric, revealing significant capability gaps in lower-resource languages for many models.

## Significance and Legacy

GSM8K's influence on the field of AI research extends well beyond its role as a leaderboard.

### Advancing Chain-of-Thought Research

GSM8K provided the testing ground for some of the most important prompting innovations of the 2020s. Chain-of-thought prompting, self-consistency, program-aided reasoning, [process reward models](/wiki/process_reward_model), and various verification strategies were all evaluated primarily on GSM8K. The benchmark's moderate difficulty made it ideal for this purpose: it was hard enough to show meaningful differences between methods but not so hard that most approaches scored near zero.

### Verification and Process Supervision

The verifier approach introduced in the original GSM8K paper laid the groundwork for later research on [process reward models](/wiki/process_reward_model) (PRMs) and outcome reward models (ORMs). OpenAI's subsequent work on process supervision ("Let's Verify Step by Step," Lightman et al. 2023)[^18], which involves training models to evaluate the correctness of each individual reasoning step rather than just the final answer, drew directly from the verification framework developed for GSM8K. This line of research has become central to how [reasoning models](/wiki/reasoning_models) like [OpenAI o1](/wiki/o1), [o3](/wiki/o3), and [DeepSeek-R1](/wiki/deepseek_r1) are trained.

### Test-Time Compute Scaling

The original paper's finding that generating multiple candidate solutions and selecting the best one ([test-time compute](/wiki/test_time_compute) scaling) could substitute for massive increases in model size foreshadowed a broader trend in AI research. The concept of spending more computation at inference time, rather than solely at training time, has become a key design principle behind reasoning-focused models. OpenAI's [o-series models](/wiki/openai_o-series) represent the most prominent application of this principle, using extended internal chain-of-thought reasoning at inference time to improve accuracy on challenging problems.

### Standardization of Evaluation

GSM8K helped establish the convention of reporting benchmark scores in model release announcements and technical reports. Alongside [MMLU](/wiki/mmlu) and [HumanEval](/wiki/humaneval), GSM8K became part of the standard trio of benchmarks that virtually every major language model was evaluated on from 2022 through 2024. See [LLM Benchmarks Timeline](/wiki/llm_benchmarks_timeline) for the broader context.

### Lessons About Benchmark Design

The issues that emerged with GSM8K, including data contamination, label noise, and saturation, have provided valuable lessons for the design of future benchmarks. The creation of GSM1K, GSM-Plus, GSM-Symbolic, GSM-NoOp, and GSM8K-Platinum all represent direct responses to limitations identified in the original benchmark. These efforts have pushed the community toward practices such as using held-out problem sets, template-based problem generation, systematic error auditing, adversarial perturbations, and designing benchmarks with higher difficulty ceilings.

## Accessing the Dataset

GSM8K is freely available under the MIT License. The primary distribution channels are:

- **GitHub:** [github.com/openai/grade-school-math](https://github.com/openai/grade-school-math) (archived; no further updates expected)
- **Hugging Face:** [huggingface.co/datasets/openai/gsm8k](https://huggingface.co/datasets/openai/gsm8k)

The repository includes the training and test data in both the main and Socratic configurations, along with example model solutions from 6B and 175B parameter models and reference code for answer extraction and evaluation. The total dataset size is approximately 5.89 MB.

## See Also

- [MMLU-ProX](/wiki/mmlu_prox)
- [Chain-of-Thought Prompting](/wiki/chain_of_thought)
- [MATH (benchmark)](/wiki/math_benchmark)
- [MMLU](/wiki/mmlu)
- [HumanEval](/wiki/humaneval)
- [Large Language Model](/wiki/large_language_model)
- [Test-time compute](/wiki/test_time_compute)
- [AIME](/wiki/aime)
- [FrontierMath](/wiki/frontiermath)
- [Reasoning models](/wiki/reasoning_models)
- [Process reward model](/wiki/process_reward_model)
- [Benchmarks](/wiki/benchmarks)

## References

[^1]: Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., & Schulman, J. (2021). "Training Verifiers to Solve Math Word Problems." arXiv:2110.14168. https://arxiv.org/abs/2110.14168

[^2]: Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2022). "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." NeurIPS 2022. arXiv:2201.11903. https://arxiv.org/abs/2201.11903

[^3]: Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., & Zhou, D. (2023). "Self-Consistency Improves Chain of Thought Reasoning in Language Models." ICLR 2023. arXiv:2203.11171. https://arxiv.org/abs/2203.11171

[^4]: Lewkowycz, A., Andreassen, A., Dohan, D., et al. (2022). "Solving Quantitative Reasoning Problems with Language Models." (Minerva) NeurIPS 2022. arXiv:2206.14858. https://arxiv.org/abs/2206.14858

[^5]: OpenAI. (2023). "GPT-4 Technical Report." arXiv:2303.08774. https://arxiv.org/abs/2303.08774

[^6]: Zhang, H., Da, J., et al. (2024). "A Careful Examination of Large Language Model Performance on Grade School Arithmetic." NeurIPS 2024 Datasets and [Benchmarks](/wiki/benchmarks) Track (Spotlight). arXiv:2405.00332. https://arxiv.org/abs/2405.00332

[^7]: Mirzadeh, I., Alizadeh, K., Shahrokhi, H., Tuzel, O., Bengio, S., & Farajtabar, M. (2024). "GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models." ICLR 2025. arXiv:2410.05229. https://arxiv.org/abs/2410.05229

[^8]: Vendrow, E., Vendrow, J., Madry, A., & Beery, S. (2025). "GSM8K-Platinum: Revealing Performance Gaps in Frontier LLMs." MIT Madry Lab. https://gradientscience.org/gsm8k-platinum/

[^9]: Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., & Steinhardt, J. (2021). "Measuring Mathematical Problem Solving With the MATH Dataset." NeurIPS 2021. arXiv:2103.03874. https://arxiv.org/abs/2103.03874

[^10]: Chowdhery, A., et al. (2022). "PaLM: Scaling Language Modeling with Pathways." arXiv:2204.02311. https://arxiv.org/abs/2204.02311

[^11]: OpenAI. (2024). "Learning to Reason with LLMs." https://openai.com/index/learning-to-reason-with-llms/

[^12]: Zhu, Q., et al. (2023). "MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation." arXiv:2312.17080. https://arxiv.org/abs/2312.17080

[^13]: Surge AI. "How Surge AI Built OpenAI's GSM8K Dataset of 8,500 Math Problems." https://surgehq.ai/blog/how-we-built-it-openais-gsm8k-dataset-of-8500-math-problems

[^14]: Gao, L., Madaan, A., Zhou, S., Alon, U., Liu, P., Yang, Y., Callan, J., & Neubig, G. (2022). "PAL: Program-aided Language Models." ICML 2023. arXiv:2211.10435. https://arxiv.org/abs/2211.10435

[^15]: Chen, W., Ma, X., Wang, X., & Cohen, W. W. (2022). "Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks." TMLR 2023. arXiv:2211.12588. https://arxiv.org/abs/2211.12588

[^16]: Li, Q., Cui, L., Zhao, X., Kong, L., & Bi, W. (2024). "GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers." ACL 2024. arXiv:2402.19255. https://arxiv.org/abs/2402.19255

[^17]: DeepSeek-AI. (2025). "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv:2501.12948. https://arxiv.org/abs/2501.12948

[^18]: Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., & Cobbe, K. (2023). "Let's Verify Step by Step." arXiv:2305.20050. https://arxiv.org/abs/2305.20050