**
| FrontierMath |
|---|
| Overview |
| Full name |
| Abbreviation |
| Description |
| Release date |
| Latest version |
| Benchmark updated |
| Authors |
| Organization |
| Technical Details |
| Type |
| Modality |
| Task format |
| Number of tasks |
| Total examples |
| Evaluation metric |
| Domains |
| Languages |
| Performance |
| Human performance |
| Baseline |
| SOTA score |
| SOTA model |
| SOTA date |
| Saturated |
| Resources |
| Website |
| Paper |
| Dataset |
| License |
|
FrontierMath is an advanced mathematical reasoning benchmark created by Epoch AI in collaboration with over 60 expert mathematicians, including Fields Medalists Terence Tao, Timothy Gowers, and Richard Borcherds. First published on November 8, 2024, FrontierMath consists of hundreds of original, research-level mathematics problems designed to test the outer limits of artificial intelligence systems' mathematical capabilities. At launch, every frontier AI model scored below 2% on the benchmark. By March 2026, the best-performing model, OpenAI's GPT-5.4 Pro, solves roughly 50% of Tier 1-3 problems, a tenfold improvement in under two years that still leaves half the benchmark unsolved[1].
The project also includes FrontierMath: Open Problems, a pilot collection of 14 genuinely unsolved mathematical problems whose solutions, if found, would advance the state of human mathematical knowledge[2].
By 2024, the most widely used mathematical benchmarks for AI had become saturated. Models routinely scored above 95% on GSM8K (grade-school math), above 90% on the MATH dataset (competition-level problems), and 70-90% on AIME-style questions[3]. These high scores made it difficult to distinguish between models or to measure genuine progress in mathematical reasoning.
Epoch AI, a nonprofit research organization focused on tracking AI progress, set out to build a benchmark that would remain challenging for years. The core idea was straightforward: recruit active research mathematicians to write problems drawn from their own fields, problems that require hours or days of expert effort and whose answers can be checked automatically by a computer program.
Elliot Glazer, the project's lead mathematician, holds a Ph.D. in mathematics from Harvard, where he studied set theory under Hugh Woodin. He was joined by Tamay Besiroglu, Epoch AI's associate director, and Ege Erdil as the three core contributors. The broader team eventually grew to include over 60 mathematicians from institutions such as MIT, Harvard, Princeton, Stanford, Cambridge, Oxford, the University of Leicester, King's College London, Cornell, UC Berkeley, and Bristol University, among others. Fourteen IMO gold medalists and three Fields Medal recipients participated in problem creation or review[4].
FrontierMath has expanded since its initial release into three distinct components, each targeting a different level of mathematical difficulty.
The original base set contains 300 problems spanning difficulty from advanced undergraduate to early postdoctoral level. This set forms the core benchmark used in most published evaluations. Problems are classified using the Mathematics Subject Classification (MSC2020) system and cover virtually every major branch of modern mathematics[4].
Released on July 1, 2025, Tier 4 adds 50 exceptionally difficult research-level problems to the benchmark. These problems were largely designed or refined during a symposium attended by leading mathematicians, where each problem was tested and approved by a panel of experts. Of the 50 Tier 4 problems, 2 are public and 48 are private. Even the strongest AI systems as of mid-2025, including OpenAI's o4-mini, Anthropic's Claude Opus 4, and Google's Gemini 2.5 Pro, achieve only single-digit success rates on Tier 4[5].
On January 27, 2026, Epoch AI launched FrontierMath: Open Problems, a pilot benchmark of 14 genuinely unsolved mathematical research problems. Unlike the main benchmark, where problems have known solutions that an expert created, these are problems that professional mathematicians have attempted and failed to solve. The pilot set tilts toward combinatorics and number theory, where the most problems amenable to automatic verification were found[2].
Each open problem includes a difficulty estimate from its contributor. Estimated solving times range from one to four weeks at the low end to three to ten years at the high end. The number of serious human attempts per problem ranges from two or three mathematicians to over fifty. Significance ratings span from "moderately interesting results" to "major breakthroughs"[2].
Two problems added to the benchmark in February 2026 illustrate the scope: finding a Hadamard matrix of order 668 (the smallest order for which none is known) and proving that certain "small" Diophantine equations have infinitely many solutions[6].
Every FrontierMath problem must satisfy four requirements before it enters the benchmark[4]:
| Requirement | Description | Purpose |
|---|---|---|
| Originality | Problems build on existing ideas in novel, non-obvious ways through clever adaptations or innovative combinations | Prevents data contamination from training sets |
| Automated verifiability | Solutions must be computable and expressible as Python objects or SymPy structures (integers, symbolic expressions, matrices, sets) | Allows scalable, objective evaluation |
| Guessproofness | Less than 1% probability of arriving at the correct answer without performing the required mathematical work | Ensures models cannot succeed through random guessing or superficial heuristics |
| Computational tractability | Solution verification scripts must run in under one minute on standard hardware | Keeps evaluation practical |
Each problem is rated along three dimensions by its creator and at least one peer reviewer[4]:
| Dimension | Scale | Description |
|---|---|---|
| Background knowledge | 1-5 | 1 = high school level; 2 = early undergraduate; 3 = late undergraduate; 4 = graduate; 5 = research level |
| Creativity | Hours (unbounded) | Time an expert in the relevant field would need to identify the key solution ideas |
| Execution | Hours (unbounded) | Time to compute the final answer once the key ideas are identified, including writing any necessary code |
The authors note that these ratings provide rough guidance rather than definitive claims, since problems can become easier once a specific technique is known, and multiple solution paths of varying difficulty may exist[4].
The benchmark spans most major branches of modern mathematics. The distribution of problems by MSC2020 primary classification is as follows[4]:
| MSC Code | Field | Share of problems | Involvement in multi-domain problems |
|---|---|---|---|
| 11 | Number theory | 17.8% | 44% of all problems involve number theory |
| 05 | Combinatorics | 15.8% | 39% of all problems involve combinatorics |
| 20 | Group theory | 8.9% | 22% of all problems involve group theory |
| 60 | Probability theory | 5.1% | - |
| 15 | Linear algebra | 4.8% | - |
| 14 | Algebraic geometry | 4.8% | - |
| 33 | Special functions | 4.8% | - |
| 55 | Algebraic topology | 3.1% | - |
| 12 | Field theory | 2.4% | - |
| 30 | Complex analysis | 2.4% | - |
| 68 | Computer science | 2.4% | - |
| 18 | Category theory | 2.4% | - |
| 57 | Manifolds and cell complexes | 2.1% | - |
| 13 | Commutative algebra | 2.1% | - |
| Other | 17 additional fields | 21.1% | Includes PDEs, differential geometry, harmonic analysis, statistical mechanics, and more |
Notably, 13% of problems combine number theory and combinatorics, 9% combine combinatorics and group theory, and 8% combine number theory and group theory. Over 200 distinct solution techniques are represented across the benchmark, and even the most common techniques (generating functions, recurrences, special functions) each appear in fewer than 5% of problems[4].
The process for creating and reviewing FrontierMath problems involves multiple stages[4]:
| Stage | Process | Quality control |
|---|---|---|
| Problem design | Expert mathematicians create original problems in their research areas | Must satisfy all four core requirements |
| Solution development | Authors write a solution script in Python that computes the answer | Script must terminate in under one minute |
| Verification design | Develop automated checking methods using exact matching, SymPy evaluation, or computational verification | Ensure answers are unambiguous |
| Blind peer review | At least one domain expert mathematician reviews each problem without knowledge of the solution approach | Reviewers assess correctness, ambiguity, guessproofness, and difficulty ratings |
| Second-round review | A random subset of 25 problems receives an additional blind review | Provides error rate estimates |
| Error correction | Problems flagged during review are revised or removed | Estimated error rate: roughly 10% (1 incorrect answer found in 25 reviewed problems) |
| Final validation | Complete verification testing on all accepted problems | Confirms automated checking works reliably |
Because the value of the benchmark depends on problems being unknown to AI training pipelines, Epoch AI employs several security measures[4]:
Each problem undergoes a guessproofness check to confirm that the answer space is large enough (typically exceeding 10^6 possibilities) and that no obvious patterns would allow a model to stumble on the correct answer. Problems typically require large, non-obvious numerical answers or complex mathematical objects as solutions. The target is a less than 1% success rate for random or heuristic guessing[4].
Models are evaluated in an interactive Python environment. The evaluation framework gives each model access to the following capabilities[4]:
| Capability | Description |
|---|---|
| Code execution | Write and run Python code to perform calculations |
| Library access | Use standard mathematical libraries (SymPy, NumPy, SciPy, etc.) |
| Iterative problem solving | Multiple attempts are allowed within the token budget |
| Result verification | Models can check intermediate results before final submission |
For Tier 4 evaluations, models receive a 1,000,000-token hard limit with a 660,000-token warning threshold. The model submits a Python function that returns its answer after reasoning and code execution[5].
When a model submits its answer, verification proceeds automatically[4]:
| Method | Description | Example |
|---|---|---|
| Exact integer matching | Compare submitted integer to known answer | "The answer is 3677073" |
| SymPy symbolic evaluation | Check if the difference between submitted and known expressions simplifies to zero | Polynomial equality |
| Computational object verification | Verify properties of submitted mathematical structures | Check that a submitted matrix satisfies required group properties |
| Numerical tolerance | For floating-point answers, check within a specified tolerance | Approximation results |
The model's code must include a specific marker comment (# This is the final answer), save the result using Python's pickle module, and be fully self-contained[4].
The following table shows how model performance has evolved since the benchmark's release[1][7][8][9][10]:
| Model | Organization | Score | Date | Notes |
|---|---|---|---|---|
| GPT-5.4 Pro | OpenAI | ~50% | March 2026 | Current SOTA; also scored 38% on Tier 4 |
| GPT-5.2 (Thinking) | OpenAI | 40.3% | Late 2025 | First model above 40% |
| GPT-5.1 | OpenAI | 26.7% | 2025 | Multiple variants at same score |
| GPT-5 | OpenAI | 26.3% | 2025 | - |
| GPT-5 mini | OpenAI | 22.1% | 2025 | - |
| o3 (public release) | OpenAI | ~10% (Epoch AI), 25.2% (OpenAI internal) | April 2025 / December 2024 | Score discrepancy became controversial (see below) |
| Grok 4 | xAI | ~14% | 2025 | - |
| Gemini 2.5 Pro | Google DeepMind | ~11% | 2025 | - |
| o3-mini | OpenAI | 8.9-9.2% | 2025 | Medium reasoning setting |
| Claude Opus 4.1 | Anthropic | ~7% | 2025 | Epoch AI evaluation |
| o1 | OpenAI | 5.5% | 2025 | - |
| DeepSeek R1 | DeepSeek | 5.2% | 2025 | Open-source leader at the time |
| Gemini 2.0 Flash Thinking | 2.6% | 2025 | Experimental version | |
| Claude 3.5 Sonnet | Anthropic | <2% | November 2024 | Initial evaluation |
| GPT-4o | OpenAI | <2% | November 2024 | Initial evaluation |
| o1-preview | OpenAI | <2% | November 2024 | Initial evaluation |
| Gemini 1.5 Pro | <2% | November 2024 | Initial evaluation | |
| Grok 2 Beta | xAI | <2% | November 2024 | Initial evaluation |
Tier 4 scores are reported separately due to the significantly higher difficulty[5]:
| Model | Score | Notes |
|---|---|---|
| GPT-5.4 Pro | ~38% | March 2026 |
| GPT-5.2 Pro | Highest (pre-correction) | Benefited from grader corrections in v1.1.4 |
| Gemini 3 Pro | 19% (+/- 6%) | 3 of 48 samples failed due to API errors |
| Grok 4 | 2% (+/- 2%) | 8 of 48 samples had API errors |
| DeepSeek V3.2 (Thinking) | ~2% | Only Chinese-origin model to score above zero on Tier 4 |
In the original November 2024 evaluation, the paper's authors documented several behavioral patterns across the six tested models[4]:
Four prominent mathematicians were interviewed for the FrontierMath paper: Terence Tao (2006 Fields Medalist), Timothy Gowers (1998 Fields Medalist), Richard Borcherds (1998 Fields Medalist), and Evan Chen (IMO coach and benchmark co-author). Their comments offer a window into how professional mathematicians view the benchmark's difficulty and significance[4].
Tao contributed several problems to the benchmark and reviewed others. He described the problems as "extremely challenging" and predicted the benchmark would "resist AIs for several years at least." On the scarcity of relevant training data, Tao observed that for many FrontierMath problems, the relevant material is "almost nonexistent... you're talking like a dozen papers with relevant things"[4].
Tao suggested that human experts working alongside AI systems could tackle FrontierMath problems within about three years, noting that guiding current AI to correct solutions takes "about five times as much effort" as solving the problems directly. He expected this ratio to improve and eventually drop below 1 for certain problems within a few years, given sufficient tooling and capability improvements[4].
On practical considerations, Tao remarked that if AI tools require "three days of compute off of all of Google to solve each problem... that's less of a useful tool"[4].
Gowers reported that "all of the problems I looked at were not really in my area and all looked like things I had no idea how to solve." He emphasized that the problems "appear to be at a different level of difficulty from IMO problems," requiring familiarity with "the tricks of the trade of some particular branch of maths," a kind of domain knowledge that is hard to acquire without substantial, specialized training data[4].
Gowers also offered a practical vision for AI in mathematics, suggesting that AI systems could help with "slightly boring bits of doing research where you, for example, make some conjecture that would be useful, but you're not quite sure if it's true... it could be a very, very nice time saving device"[4].
Borcherds was described in the paper as "the most bullish" among the interviewees about AI's potential in mathematics. He did note, however, that the benchmark problems "aren't quite the same as coming up with original proofs," drawing a distinction between solving a problem with a known answer and generating new mathematical knowledge[4].
Evan Chen, a well-known mathematics educator and IMO coach who also co-authored the FrontierMath paper, published a separate blog post analyzing the benchmark's design philosophy. He noted that FrontierMath inverts two of the three desirable properties of traditional competition problems (like those at the IMO or Putnam exam). While FrontierMath retains the requirement for creative insight, it deliberately abandons the simplicity requirement and assumes the solver has "access to a Python console and a lot of reference text." Chen praised the authors for being "pretty ruthless about rejecting problems for which they felt it was possible to guess the answer" through engineer's induction[11].
Chen identified a key advantage of FrontierMath's design: its ability to use "easily verifiable solutions" through code implementation, similar to the International Olympiad in Informatics or Project Euler. This contrasts with pencil-and-paper competitions where human coordinators must evaluate proofs[11].
On December 20, 2024, OpenAI announced its o3 reasoning model and reported a 25.2% score on FrontierMath, a dramatic leap from the previous best of under 2%. This result was highlighted during the o3 launch event as evidence of a breakthrough in mathematical reasoning[7].
On April 18, 2025, Epoch AI published its own independent evaluation of the publicly released o3 model, reporting a score of approximately 10%, significantly below OpenAI's claim. Epoch AI identified several factors that could explain the discrepancy[8]:
| Factor | OpenAI's testing (December 2024) | Epoch AI's testing (April 2025) |
|---|---|---|
| Model version | Pre-release internal version | Public release version, "tuned for chat/product use" |
| Compute resources | "Aggressive test-time compute" | Standard compute tiers |
| Problem set | 180 problems (frontiermath-2024-11-26) | 290 problems (frontiermath-2025-02-28) |
| Scaffolding | Internal advanced scaffold | Public API scaffold |
Epoch AI noted: "The difference between our results and OpenAI's might be due to OpenAI evaluating with a more powerful internal scaffold, using more test-time computing, or because those results were run on a different subset of FrontierMath"[8].
The o3 announcement also triggered scrutiny of the financial relationship between OpenAI and Epoch AI. On the same day o3 was announced (December 20, 2024), Epoch AI disclosed that OpenAI had funded the creation of FrontierMath. Several problems quickly emerged[12][13]:
The controversy drew criticism from multiple outlets. Fortune described it as "manipulative and disgraceful." TechCrunch reported that the benchmarking organization was "criticized for waiting to disclose funding from OpenAI." The incident raised broader questions about independence in AI benchmarking and the risks of conflicts of interest when AI companies fund the benchmarks used to evaluate their own models[12][13].
Epoch AI is primarily funded by Open Philanthropy, and the OpenAI funding for FrontierMath was a separate, project-specific arrangement[12].
On March 24, 2026, Epoch AI confirmed that GPT-5.4 Pro had produced a verified solution to a genuinely open mathematical problem on FrontierMath: a Ramsey-style problem on hypergraphs that had remained unsolved since it was posed by mathematicians Will Brian and Paul Larson in a 2019 paper. The solution was first elicited by researchers Kevin Barreto and Liam Price using GPT-5.4 Pro. Problem contributor Will Brian confirmed the solution's correctness, and a write-up is being prepared for publication[14].
This marked the first time an AI model produced a novel solution to an open problem on the FrontierMath benchmark. After the initial solve, several other frontier models also solved the same problem: Claude Opus 4.6 (max), Gemini 3.1 Pro, and GPT-5.4 (xhigh). The fact that multiple models could solve it suggests the problem sat at the boundary of current frontier model capabilities[14].
The Ramsey hypergraph result is part of a wider trend. Since Christmas 2025, 15 open mathematical problems have moved from unsolved to solved, with 11 of them (73%) credited to AI involvement. However, Epoch AI also noted that when GPT-5.4 Pro was evaluated on the full set of FrontierMath Open Problems, it "did not solve any problems" other than the Ramsey one, and its novel observations on one other problem were "of a form that the author had anticipated and characterized as relatively uninteresting"[14].
While most problems remain private to prevent contamination, the original paper includes five public sample problems at varying difficulty levels[4]:
| Problem | Author | Difficulty | Field (MSC) | Key techniques | Creativity (hours) | Execution (hours) | Answer |
|---|---|---|---|---|---|---|---|
| Testing Artin's primitive root conjecture | O. Jarviniemi | Research level | Number theory (11) | Frobenius elements, Artin symbols | 4 | 15 | 3,677,073 |
| Find degree 19 polynomial | A. Kite | Research level | Algebraic geometry (14), Group theory (20), Number theory (11) | Monodromy, branch loci | 3 | 4 | 1,876,572,071,974,094,803,391,179 |
| Prime field continuous extensions | D. Chicharro | Graduate level | Number theory (11) | p-adic analysis, recurrences | 3 | 3 | 9,811 |
| Coxeter group problem | P. Enugandla | Graduate level | Group theory (20) | Coxeter groups, characters | 2 | 3 | (not disclosed in sample) |
| Algebraic geometry/number theory problem | A. Gunning | Undergraduate level | Algebraic geometry (14), Number theory (11) | Hasse-Weil bound | 2 | 2 | (not disclosed in sample) |
These samples illustrate several features of the benchmark: answers are large, non-obvious integers (making them guessproof); problems span multiple mathematical fields; and even the "easiest" problem requires two hours of creative work from an expert.
| Benchmark | AI performance (approximate best) | Typical problem level | Typical solving time (human) | Primary limitation |
|---|---|---|---|---|
| GSM8K | >95% | Grade school | Minutes | Saturated since 2024 |
| MATH | >90% | High school/competition | 30 minutes | Saturated; data contamination risk |
| AIME | 70-90% | Competition mathematics | Hours | Approaching saturation |
| MMLU (math subset) | >85% | Mixed undergraduate | Varies | Not math-specific |
| FrontierMath (Tiers 1-3) | ~50% | Undergraduate to postdoc | Hours to days | Still challenging; majority unsolved |
| FrontierMath (Tier 4) | ~38% | Research level | Days to weeks | Very few models score above single digits |
| FrontierMath (Open Problems) | 1 problem solved | Unsolved research | Weeks to years | Virtually all problems remain unsolved |
| Feature | FrontierMath | Typical math benchmarks |
|---|---|---|
| Problem source | Original, unpublished, created by active researchers | Often drawn from textbooks, competitions, or publicly available problem sets |
| Answer verification | Fully automated via Python/SymPy | Often requires human grading or proof checking |
| Data contamination risk | Minimal (private problem set, encrypted distribution) | High (problems publicly available, may appear in training data) |
| Difficulty range | Undergraduate through active research | Typically grade school through undergraduate |
| Time investment per problem | Hours to days for experts | Minutes to hours |
| Multi-domain integration | 44% of problems involve multiple mathematical fields | Most problems stay within a single topic |
| Name | Fields Medal year | Role |
|---|---|---|
| Terence Tao | 2006 | Problem creation, review, and interview |
| Timothy Gowers | 1998 | Problem review and interview |
| Richard Borcherds | 1998 | Problem review and interview |
| Name | Role | Background |
|---|---|---|
| Elliot Glazer | Lead mathematician | Ph.D. in mathematics from Harvard (set theory under Hugh Woodin) |
| Tamay Besiroglu | Associate director, Epoch AI | Previously at MIT Future Tech Lab; led strategy for Metaculus |
| Ege Erdil | Core contributor | Epoch AI researcher |
| Evan Chen | Co-author and contributor | IMO coach, mathematics educator |
Over 60 mathematicians from leading institutions contributed, including researchers from MIT, Harvard, Princeton, Stanford, Cambridge, Oxford, Cornell, UC Berkeley, King's College London, the University of Leicester, the University of Siegen, ICMC USP (Brazil), and Bristol University, among others.
Models interact with a Python environment where they can write and execute code, test hypotheses, and submit answers. A simplified conceptual overview of the evaluation framework:
# Conceptual evaluation framework (simplified)
class FrontierMathEvaluator:
def evaluate_model(self, model, problem):
environment = PythonEnvironment()
max_attempts = 10
for attempt in range(max_attempts):
code = model.generate_code(problem, environment.state)
result = environment.execute(code)
if model.verify_answer(result, problem):
return self.check_solution(result, problem.answer)
return False
| Access level | Description | How to obtain |
|---|---|---|
| Public samples | Small set of example problems with full solutions | Free access via epoch.ai/frontiermath |
| Open Problems verifiers | Solution verifiers for the 14 open problems | Partnership with Epoch AI (math@epoch.ai); uniform access fee |
| Research evaluation | Full benchmark evaluation on the private set | Contact math_evals@epoch.ai |
| Commercial evaluation | Model testing service | Partnership with Epoch AI |
| Problem contribution | Submit new problems for inclusion | Expert mathematician credentials required |
FrontierMath's development has been supported by:
| Initiative | Description | Status |
|---|---|---|
| Problem expansion | Adding new problems to Tiers 1-4 | Ongoing; quarterly updates |
| Domain coverage | Expanding to additional mathematical fields | 2025-2026 |
| Tier 4 updates | Bug fixes and grader corrections (version bumped to 1.1.4 in 2026) | Ongoing |
| Open Problems growth | Expanding beyond the 14-problem pilot set | Planning stage |
| Verification improvements | Refining automated checking methods | Continuous |
FrontierMath has had a measurable effect on the AI research community since its release:
The trajectory from under 2% (November 2024) to roughly 50% (March 2026) on Tiers 1-3 is one of the fastest rates of improvement on any major AI benchmark. Yet the benchmark remains far from saturated. Tier 4 scores remain below 40% for the best model and in single digits for most, and virtually all Open Problems remain unsolved[1][5][14].
| Limitation | Description | Mitigation |
|---|---|---|
| Limited public access | Most problems are private to preserve benchmark integrity | Necessary trade-off; sample problems are publicly available |
| Narrow scope | Only tests mathematical problem-solving; does not assess proof writing, mathematical intuition, or pedagogical ability | Complements other benchmarks |
| English only | All problems are written in English | Future multilingual expansion is planned |
| Computational bias | Problems must have automatically verifiable answers, excluding proof-based and open-ended mathematical reasoning | Acknowledged limitation of the automated verification approach |
| Estimated error rate | Roughly 10% of problems may contain errors based on review sampling | Ongoing review and correction process |
Several criticisms have been raised since the benchmark's launch: