NuminaMath
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,992 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,992 words
Add missing citations, update stale details, or suggest a clearer explanation.
NuminaMath is a family of open mathematical reasoning resources developed by Project Numina, a non profit founded in late 2023 to advance the role of artificial intelligence in mathematics. The project comprises a large publicly released competition mathematics corpus (the NuminaMath dataset), associated language models fine tuned from DeepSeekMath base weights (NuminaMath-7B-CoT, NuminaMath-7B-TIR, and 72B variants), and successive dataset revisions (NuminaMath 1.5 and the related NuminaMath-LEAN corpus). NuminaMath gained wide visibility in July 2024 when the NuminaMath-7B-TIR model, jointly developed by Project Numina and Hugging Face, won the first AI Mathematical Olympiad (AIMO) Progress Prize on Kaggle with a score of 29 out of 50 on the private leaderboard, scoring identically on both the public and private splits.[^1][^2]
The defining characteristic of the NuminaMath effort is that every layer of the stack was released openly under the Apache License 2.0: roughly 860,000 competition style problem solution pairs in the original dataset, the training and inference code on GitHub, the fine tuned 7B model weights on Hugging Face, and a technical report detailing the data construction pipeline.[^3][^4][^5] This fully open posture distinguished NuminaMath from the proprietary math reasoning datasets used internally by frontier laboratories such as DeepSeek, and it made the project a foundational resource for subsequent open source work on mathematical reasoning, tool integrated reasoning, and formal proof in Lean 4.
Project Numina (also referred to in some materials simply as Numina) was founded in late 2023 by Jia Li, Yann Fleureau, Guillaume Lample, Stanislas (Stan) Polu, and Hélène Evain.[^1][^6] The founders include researchers and practitioners associated with elite mathematics competitions and with frontier AI laboratories; Guillaume Lample and Stanislas Polu had previously worked on neural theorem proving and language model based mathematical reasoning, while Jia Li and Yann Fleureau drove the data engineering side. The stated mission of the organisation is to be "an open scientific collaboration fostering the development of human and artificial intelligence in the field of mathematics," and Numina operates as a non profit rather than a commercial entity.[^6]
Initial support came from Mistral AI in late 2023, with further backing from Hugging Face, Answer.AI, General Catalyst, and Beijing CMLR (the Center for Machine Learning Research at Peking University) in early 2024.[^1][^7] Hugging Face engineers Lewis Tunstall and Edward Beeching joined the effort in early 2024 in the run up to the AIMO competition, which became the catalyst for releasing the project's first model and dataset publicly.[^1] In December 2024 Project Numina received a three million euro research grant from XTX Markets, the algorithmic trading firm that also bankrolls the AIMO Prize fund. The grant earmarked funding for formalising up to 100,000 mathematical items in Lean 4 and, more ambitiously, working toward the public release of a database of up to one million formal mathematical problems and proofs.[^6][^8]
The original NuminaMath-CoT dataset was released on the Hugging Face Hub in July 2024 under the identifier AI-MO/NuminaMath-CoT. It contains 859,608 problem solution pairs (commonly summarised as "860K") split between an 859,000 row train partition and a 100 row test partition. Each example consists of a problem statement, a solution rewritten or generated in a unified Chain of Thought (CoT) format with the final numerical answer placed inside a \boxed{} LaTeX command, and a messages field encoding a two turn user assistant conversation suitable for instruction tuning.[^3][^5]
The dataset aggregates problems from nine source categories, which according to the dataset card break down as follows:[^3]
| Source | Samples |
|---|---|
| cn_k12 (Chinese K to 12 math exercises) | 276,591 |
| synthetic_math | 167,895 |
| orca_math | 153,334 |
| olympiads (international olympiad problems) | 150,581 |
| synthetic_amc | 62,111 |
| aops_forum (Art of Problem Solving forum) | 30,201 |
| math (MATH dataset subset) | 7,478 |
| gsm8k | 7,345 |
| amc_aime | 4,072 |
| Total | 859,608 |
The raw inputs to the pipeline were online exam paper PDFs and discussion forum threads in several languages, predominantly English and Chinese. The construction pipeline followed five stages: optical character recognition (OCR) to extract problems and solutions from PDF scans; segmentation into problem solution pairs; translation into English; realignment of free form solutions into a consistent Chain of Thought structure; and standardisation of the final answer in a \boxed{} wrapper.[^1][^3] Where original solutions were absent, malformed, or not amenable to consistent CoT formatting, GPT-4 (and later GPT-4o) was used to rewrite or synthesise candidate solutions that were then filtered against any available reference answers.[^5] The end result is a corpus in which every sample is in English, in a uniform CoT style, and in a single Parquet file format, with all data released under Apache License 2.0.[^3]
The authors credited on the dataset card are Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu.[^3] The corresponding technical report ("NuminaMath: The largest public dataset in AI4Maths with 860k pairs of competition math problems and solutions") is hosted alongside the training code in the project-numina/aimo-progress-prize GitHub repository.[^4]
A companion dataset, NuminaMath-TIR (AI-MO/NuminaMath-TIR), holds 72,540 problem solution pairs (sometimes rounded to "70K") in which the solutions interleave natural language rationales, executable Python programs, and the outputs of those programs. NuminaMath-TIR was constructed by selecting from NuminaMath-CoT those problems whose final answer is numerical (most often a non negative integer), and then using a GPT-4 driven pipeline to generate solutions in the format introduced by the ToRA (Tool integrated Reasoning Agent) paper from Microsoft Research. The pipeline executed the Python blocks at generation time and discarded any candidate whose executed final answer did not match the reference. To increase yield and consistency, the process was repeated three times per problem.[^1][^9]
The AI Mathematical Olympiad (AIMO) Prize is a USD 10 million prize fund established by XTX Markets in 2023 to incentivise progress on artificial mathematicians capable of competing at the International Mathematical Olympiad (IMO). The grand prize requires a publicly verifiable, open source system reaching gold medal performance at the IMO. To bridge the gap between current systems and the final prize, the AIMO programme runs intermediate Progress Prizes hosted on Kaggle, each with its own prize pool.[^10][^11]
The first AIMO Progress Prize ran on Kaggle in 2024. The competition drew 16,104 registrations across 1,161 teams from 81 countries, generating 1,831 submissions; 392 of the participants were first time Kaggle competitors, of whom 32 finished in the top 100.[^2][^12] The task involved solving 50 held out competition style mathematics problems, with strong submission constraints to encourage compact, openly reproducible systems: solutions had to run on Kaggle's two NVIDIA T4 GPU notebook within a tight time budget, and the top finishing teams were required to release their models, data, and code openly.[^1][^11]
Team Numina won the first AIMO Progress Prize with a model called NuminaMath-7B-TIR, scoring 29 out of 50 on the private test set (it also scored 29 out of 50 on the public split, which is unusual and reflects a robust and well calibrated solution).[^1][^2] The team's first prize award was USD 131,072 (the AIMO organisers awarded prize amounts in powers of two), drawn from the total Progress Prize pool of one million dollars.[^11][^12] The prize was presented at the International Mathematical Olympiad in Bath, United Kingdom on Saturday 20 July 2024, with the awards handed out by the Fields Medalist Terence Tao.[^2] The winning entry was a collaboration between Project Numina and Hugging Face; the Hugging Face engineering side of the partnership was led by Lewis Tunstall and Edward Beeching, with technical contributions from Albert Jiang, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Ziju Shen, Zihan Qin, and Project Numina founders Jia Li, Yann Fleureau, Guillaume Lample, and Stanislas Polu.[^1][^9]
NuminaMath-7B-CoT is the Stage 1 model in the two stage recipe. It is a full parameter fine tune of deepseek-ai/deepseek-math-7b-base on the NuminaMath-CoT dataset, released under the Apache License 2.0.[^13] The fine tune was performed with the TRL library's SFTTrainer, using a learning rate of 2.0e-5, a global batch size of 32 distributed across eight NVIDIA H100 GPUs on a single node, packing at 2,048 token block size, three epochs, a cosine learning rate schedule, gradient checkpointing, and DeepSpeed ZeRO 3 sharding.[^1][^13]
On the MATH benchmark, NuminaMath-7B-CoT reaches 56.3 percent zero shot greedy accuracy, materially above DeepSeekMath-7B-Instruct (46.8 percent) and DeepSeekMath-7B-RL (51.7 percent) measured under the same chain of thought protocol.[^1] The model is the open weights baseline for what a strong dataset alone, without tool use, can deliver on competition mathematics from a 7B base. The Project Numina collection on Hugging Face also contains a NuminaMath-72B-CoT variant fine tuned from a 73B parameter Qwen base for comparison, released in July 2024.[^14]
NuminaMath-7B-TIR is the Stage 2 model and the one that actually competed in the AIMO Progress Prize 1. It starts from deepseek-ai/deepseek-math-7b-base and was fine tuned with the same SFT recipe, this time on the NuminaMath-TIR dataset using a 1,024 token block size, a global batch size of 32, a learning rate of 2.0e-5 with a cosine schedule and 0.1 warmup ratio, and four epochs.[^9][^13] The full training run took roughly ten hours on eight NVIDIA H100 GPUs on a single node.[^1]
The model interleaves natural language reasoning and executable Python code in its output. At inference time the host program parses each generated Python block, executes it in an isolated sandbox, and appends the result (or the traceback) back into the prompt, prompting the model to continue. The cycle repeats until the model produces a final answer in a \boxed{} wrapper.[^1][^9] On standard benchmarks, the TIR model achieves 84.6 percent on GSM8K (zero shot), 68.1 percent on MATH (zero shot), 20 out of 40 on the 2023 American Mathematics Competitions (AMC) with majority voting over 64 samples, and 10 out of 30 on the 2024 American Invitational Mathematics Examination (AIME) with the same protocol.[^9] On the MATH benchmark specifically, the TIR variant comfortably outperforms DeepSeekMath-7B-Instruct (57.4 percent) and DeepSeekMath-7B-RL (58.8 percent) operating under tool integrated reasoning, demonstrating that the gains came from data quality, not from a stronger base model.[^1]
The AIMO winning solution combined a fine tuned base model, a high quality two stage training dataset, careful validation protocols, and an inference time procedure called Self Consistency with Tool Integrated Reasoning (SC-TIR).[^1][^4]
Two stage supervised fine tuning. The recipe trains a single base model in two passes. Stage 1 on NuminaMath-CoT teaches general competition math reasoning in natural language. Stage 2 on NuminaMath-TIR teaches the model to decompose a problem into a sequence of (rationale, python_program, program_output) triples and to terminate with a boxed final answer. The Stage 1 corpus is roughly an order of magnitude larger than the Stage 2 corpus, but Stage 2 is what unlocks the large jump in benchmark accuracy.[^1][^9]
SC-TIR inference. Because greedy decoding on a 7B model leaves significant accuracy on the table, the winning Kaggle notebook used a structured self consistency procedure. For each input problem the model was prompted N = 48 times in parallel with diverse sampling parameters; for each sample, generation halted at the end of a complete Python code block, the code was executed in a sandbox, and its output (or any error trace) was concatenated back into the prompt. This loop was repeated up to M = 4 times per sample, producing a batch of 48 by 4 candidate generations. Incomplete or unparseable outputs were pruned, and the final prediction was chosen by majority vote over the surviving candidates' boxed answers.[^1] To fit inside the Kaggle compute budget (two NVIDIA T4 GPUs without bfloat16 support and a tight wall clock), the model weights were post training quantised to 8 bit precision using AutoGPTQ. This yielded roughly a two times speed up on model load and inference with a negligible accuracy cost, and a separate AI-MO/NuminaMath-7B-TIR-GPTQ model card was published for the quantised variant.[^1]
Internal validation suite. To avoid overfitting to the public Kaggle leaderboard, the team built four bespoke validation sets, all curated from problems with integer or otherwise exactly verifiable answers: AMC 2022 and 2023 (83 problems), AIME 2022, 2023, and 2024 (90 problems), MATH Level 4 with integer outputs (754 problems), and MATH Level 5 with integer outputs (721 problems). Run to run variance on each set was tracked across five to ten seeds, giving the team a reliable signal of whether a change actually improved the model rather than just changed the noisy public leaderboard number.[^1]
Trajectory through the competition. Early in the contest, supervised fine tuning alone scored 8 out of 50 on the public leaderboard. Focusing on the MMOS (Mix of Minimal Optimal Sets) data raised the score to 16 out of 50 but plateaued because MMOS contained only single turn solutions. Adding Stage 2 TIR training delivered the next jump, and a brief experiment with Kahneman Tversky Optimization (KTO) raised the public leaderboard score to 27 out of 50. The team eventually dropped KTO and the explicit preference optimisation phase in favour of the simpler SC-TIR inference time procedure, which generalised better between the public and private splits. The final private leaderboard score was 29 out of 50.[^1]
What did not make it into the final solution. The Hugging Face blog post is unusually candid about negative results. Experiments tried and rejected include REINFORCE Leave One Out (RLOO) and PPO style reinforcement learning, model merging with DARE, TIES, and WARP, larger base models (InternLM 20B, CodeLlama 33B, Mixtral 8x7B), and infrastructure changes such as static KV caches and torch.compile. None reliably improved either the internal validation suite or the private leaderboard, reinforcing the lesson that gains were predominantly driven by data quality and a robust inference time procedure.[^1]
After the AIMO Progress Prize 1 win, Project Numina continued to iterate on both the dataset and the formal proof direction. The Hugging Face NuminaMath collection records the following timeline: NuminaMath-7B-CoT and NuminaMath-72B-CoT models on 19 July 2024; the NuminaMath-7B-TIR model on 14 August 2024 (alongside the quantised GPTQ variant); the NuminaMath-CoT and NuminaMath-TIR datasets refreshed on 25 November 2024; and the NuminaMath 1.5 dataset released on 29 January 2025.[^14]
NuminaMath 1.5 (AI-MO/NuminaMath-1.5) is positioned as the second iteration of the public dataset and contains 896,215 problems, again under Apache License 2.0.[^15] Compared with the original 860K release, the principal changes are:[^15]
answer field (the final numerical answer, the literal string proof if the problem is a proof problem, or the literal notfound if no reliable answer could be extracted), a problem_type field with one of nine subject labels (Algebra, Geometry, Number Theory, Combinatorics, Calculus, Inequalities, Logic and Puzzles, Other), and a question_type field distinguishing proofs from multiple choice problems from word problems with a numerical answer.olympiads partition was rebuilt from manual parsing and verification against official competition websites, replacing the original generic regex and language model based extraction; a new olympiads_ref subset of 3,638 problems collects reference materials from IMO, IMO shortlist, USAMO, and similar contests.cn_contest (29,944 Chinese contest problems), inequalities (7,314 problems), and number_theory (4,043 problems).synthetic_amc partition was dropped because ablation studies showed that including it degraded downstream model performance.A further release, NuminaMath-LEAN (AI-MO/NuminaMath-LEAN), is a corpus of roughly 100,000 competition problems formalised in lean 4 and compiled against mathlib version 4.15.0. It is derived from a difficult subset of NuminaMath 1.5 with explicit emphasis on IMO, USAMO, and similar olympiad problems, was produced in collaboration with the Kimi team, and underpins Kimina-Prover, a 72B parameter Lean 4 theorem proving model.[^16] NuminaMath-LEAN is the concrete deliverable of the XTX Markets grant announced in December 2024.[^6]
The Hugging Face collection does not list a NuminaMath 2 dataset; the public lineage runs from NuminaMath-CoT (the original "860K") through NuminaMath 1.5 (the 896K refresh) and into the specialised NuminaMath-LEAN corpus for formal proof, rather than a numbered 2.0 release. Project Numina has continued to be active in the second AIMO Progress Prize cycle, which closed in April 2025 with a prize pool of USD 262,144 for the winning team. The winner was NemoSkills (NVIDIA), with imagination-research (Tsinghua and Microsoft Research) second; the infrastructure developed around the NuminaMath line of work was a common reference for participating teams.[^17][^18]
NuminaMath sits within a small but rapidly growing ecosystem of mathematical reasoning corpora. The comparisons most often drawn in the technical literature and on the Hugging Face dataset cards are with MATH, GSM8K, MMOS, and OpenMathInstruct.[^1][^3][^19]
MATH. The MATH dataset, introduced by Hendrycks and colleagues in 2021, contains 12,500 competition style problems with full LaTeX solutions, organised by subject (algebra, counting and probability, geometry, intermediate algebra, number theory, prealgebra, precalculus) and by difficulty (Levels 1 to 5). MATH is used as the de facto evaluation benchmark for competition math in modern LLMs, and a 500 problem subset of MATH called MATH-500 is commonly used in faster evaluation harnesses. NuminaMath should not be confused with MATH: the two are different objects, the MATH dataset is used as one of nine source partitions inside NuminaMath-CoT, contributing 7,478 of the 860K problems.[^3]
GSM8K. GSM8K is a 8,500 problem dataset of grade school arithmetic word problems with chain of thought solutions, again from Hendrycks and collaborators. It is much smaller, much easier, and qualitatively different from NuminaMath, but it is also incorporated as a partition (7,345 problems) into NuminaMath-CoT so that downstream models retain strong performance on basic word problems.[^3]
MMOS. Mix of Minimal Optimal Sets is a smaller, single turn instruction tuning dataset used internally by competing teams in the AIMO Progress Prize 1. The NuminaMath team explicitly compared their approach against an MMOS only baseline during the competition; MMOS alone could only reach 16 out of 50 on the public leaderboard, and the move to NuminaMath-CoT plus NuminaMath-TIR was the decisive change.[^1]
OpenMathInstruct. NVIDIA's OpenMathInstruct line (OpenMathInstruct-1 in 2024 and OpenMathInstruct-2 in October 2024, the latter containing 14 million pairs generated with Llama 3.1 405B Instruct) is the closest direct competitor at scale. OpenMathInstruct-2 is roughly 16 times larger than NuminaMath-CoT and emphasises synthetic generation from a strong frontier model, whereas NuminaMath is grounded in human authored competition problems. The two corpora have complementary strengths: OpenMathInstruct-2 offers more variety and coverage of generic word problems, while NuminaMath offers higher density of olympiad style content and a fully transparent provenance chain back to human sources.[^19]
Internal datasets at frontier labs. A recurring observation in the NuminaMath technical report is that prior to NuminaMath there was no truly large competition mathematics dataset in the public domain. DeepSeek's DeepSeekMath model was trained on an internally curated corpus that has never been released, and similar internal datasets are widely assumed to exist at OpenAI, Anthropic, and Google DeepMind. NuminaMath was explicitly designed to close that gap for the open source community, and it has subsequently been used as a training source by hundreds of downstream models on the Hugging Face Hub.[^3][^4]
NuminaMath has had a disproportionate influence on the open source mathematical reasoning ecosystem. Several aspects of that influence are worth noting.
First, as a direct training source, NuminaMath-CoT has been used to fine tune more than 550 published models on the Hugging Face Hub, with NuminaMath-TIR feeding more than 800 additional models. The dataset cards record more than 57,000 downloads per month for the CoT dataset alone, making it one of the most actively used open mathematics corpora.[^3][^9]
Second, NuminaMath catalysed the template that subsequent AIMO Progress Prize entries followed: a strong open math base model fine tuned in two stages, combined with self consistency over sampled tool integrated reasoning traces at inference time. The winners of the second Progress Prize (NemoSkills) and the second placed team (imagination-research) both built on architectures that drew openly on the NuminaMath blueprint, and the Hugging Face winning post served as a de facto reference implementation for the community.[^17]
Third, NuminaMath set a community norm for full disclosure. The AIMO Prize foundation requires Progress Prize winners to publish their models openly, but the depth of NuminaMath's release, including hyperparameters, validation sets, ablation studies, and negative results, exceeded the minimum requirement and raised the bar for what an "open" winning entry looks like.[^1][^4]
Fourth, NuminaMath provided the substrate for the formal proof transition in the open community. The NuminaMath-LEAN corpus is among the largest collections of human annotated formal statements and proofs aligned with mathlib, and it underpins both Project Numina's own work and external systems such as Kimina-Prover. This dovetails with parallel open source efforts in automated theorem proving such as DeepSeek-Prover, goedel prover, and Hugging Face's open Lean 4 datasets, and it connects competition mathematics work to the broader programme of formalising mathematical knowledge in interactive theorem provers.[^16]
Finally, NuminaMath occupies a particular intellectual position in the debate about mathematical reasoning in AI. On one side stand large frontier laboratories with closed but extraordinarily capable systems: Google DeepMind's alphaproof and alphageometry / alphageometry 2 reached silver medal performance on the 2024 IMO, but did so with internal data and reinforcement learning infrastructure that is not publicly available. On the other side stand fully open systems trained on community curated data such as NuminaMath, which by 2025 had closed much of the gap on standard benchmarks while remaining fully reproducible. The continuing growth of NuminaMath, particularly its move into Lean 4 formalisation, is one reason that open source mathematical reasoning continued to be a credible competitor to closed systems through the 2024 to 2026 period.[^1][^16][^17]