FrontierMath

**

FrontierMath
Overview
Full name
Abbreviation
Description
Release date
Latest version
Benchmark updated
Authors
Organization
Technical Details
Type
Modality
Task format
Number of tasks
Total examples
Evaluation metric
Domains
Languages
Performance
Human performance
Baseline
SOTA score
SOTA model
SOTA date
Saturated
Resources
Website
Paper
Dataset
License

|

FrontierMath is an advanced mathematical reasoning benchmark created by Epoch AI in collaboration with over 60 expert mathematicians, including Fields Medalists Terence Tao, Timothy Gowers, and Richard Borcherds. First published on November 8, 2024, FrontierMath consists of hundreds of original, research-level mathematics problems designed to test the outer limits of artificial intelligence systems' mathematical capabilities. At launch, every frontier AI model scored below 2% on the benchmark. By March 2026, the best-performing model, OpenAI's GPT-5.4 Pro, solves roughly 50% of Tier 1-3 problems, a tenfold improvement in under two years that still leaves half the benchmark unsolved^[1].

The project also includes FrontierMath: Open Problems, a pilot collection of 14 genuinely unsolved mathematical problems whose solutions, if found, would advance the state of human mathematical knowledge^[2].

Background and motivation

By 2024, the most widely used mathematical benchmarks for AI had become saturated. Models routinely scored above 95% on GSM8K (grade-school math), above 90% on the MATH dataset (competition-level problems), and 70-90% on AIME-style questions^[3]. These high scores made it difficult to distinguish between models or to measure genuine progress in mathematical reasoning.

Epoch AI, a nonprofit research organization focused on tracking AI progress, set out to build a benchmark that would remain challenging for years. The core idea was straightforward: recruit active research mathematicians to write problems drawn from their own fields, problems that require hours or days of expert effort and whose answers can be checked automatically by a computer program.

Elliot Glazer, the project's lead mathematician, holds a Ph.D. in mathematics from Harvard, where he studied set theory under Hugh Woodin. He was joined by Tamay Besiroglu, Epoch AI's associate director, and Ege Erdil as the three core contributors. The broader team eventually grew to include over 60 mathematicians from institutions such as MIT, Harvard, Princeton, Stanford, Cambridge, Oxford, the University of Leicester, King's College London, Cornell, UC Berkeley, and Bristol University, among others. Fourteen IMO gold medalists and three Fields Medal recipients participated in problem creation or review^[4].

Structure and components

FrontierMath has expanded since its initial release into three distinct components, each targeting a different level of mathematical difficulty.

Tiers 1-3 (base set)

The original base set contains 300 problems spanning difficulty from advanced undergraduate to early postdoctoral level. This set forms the core benchmark used in most published evaluations. Problems are classified using the Mathematics Subject Classification (MSC2020) system and cover virtually every major branch of modern mathematics^[4].

Tier 4 (expansion set)

Released on July 1, 2025, Tier 4 adds 50 exceptionally difficult research-level problems to the benchmark. These problems were largely designed or refined during a symposium attended by leading mathematicians, where each problem was tested and approved by a panel of experts. Of the 50 Tier 4 problems, 2 are public and 48 are private. Even the strongest AI systems as of mid-2025, including OpenAI's o4-mini, Anthropic's Claude Opus 4, and Google's Gemini 2.5 Pro, achieve only single-digit success rates on Tier 4^[5].

Open Problems

On January 27, 2026, Epoch AI launched FrontierMath: Open Problems, a pilot benchmark of 14 genuinely unsolved mathematical research problems. Unlike the main benchmark, where problems have known solutions that an expert created, these are problems that professional mathematicians have attempted and failed to solve. The pilot set tilts toward combinatorics and number theory, where the most problems amenable to automatic verification were found^[2].

Each open problem includes a difficulty estimate from its contributor. Estimated solving times range from one to four weeks at the low end to three to ten years at the high end. The number of serious human attempts per problem ranges from two or three mathematicians to over fifty. Significance ratings span from "moderately interesting results" to "major breakthroughs"^[2].

Two problems added to the benchmark in February 2026 illustrate the scope: finding a Hadamard matrix of order 668 (the smallest order for which none is known) and proving that certain "small" Diophantine equations have infinitely many solutions^[6].

Problem design

Core requirements

Every FrontierMath problem must satisfy four requirements before it enters the benchmark^[4]:

Requirement	Description	Purpose
Originality	Problems build on existing ideas in novel, non-obvious ways through clever adaptations or innovative combinations	Prevents data contamination from training sets
Automated verifiability	Solutions must be computable and expressible as Python objects or SymPy structures (integers, symbolic expressions, matrices, sets)	Allows scalable, objective evaluation
Guessproofness	Less than 1% probability of arriving at the correct answer without performing the required mathematical work	Ensures models cannot succeed through random guessing or superficial heuristics
Computational tractability	Solution verification scripts must run in under one minute on standard hardware	Keeps evaluation practical

Difficulty rating system

Each problem is rated along three dimensions by its creator and at least one peer reviewer^[4]:

Dimension	Scale	Description
Background knowledge	1-5	1 = high school level; 2 = early undergraduate; 3 = late undergraduate; 4 = graduate; 5 = research level
Creativity	Hours (unbounded)	Time an expert in the relevant field would need to identify the key solution ideas
Execution	Hours (unbounded)	Time to compute the final answer once the key ideas are identified, including writing any necessary code

The authors note that these ratings provide rough guidance rather than definitive claims, since problems can become easier once a specific technique is known, and multiple solution paths of varying difficulty may exist^[4].

Mathematical domain coverage

The benchmark spans most major branches of modern mathematics. The distribution of problems by MSC2020 primary classification is as follows^[4]:

MSC Code	Field	Share of problems	Involvement in multi-domain problems
11	Number theory	17.8%	44% of all problems involve number theory
05	Combinatorics	15.8%	39% of all problems involve combinatorics
20	Group theory	8.9%	22% of all problems involve group theory
60	Probability theory	5.1%	-
15	Linear algebra	4.8%	-
14	Algebraic geometry	4.8%	-
33	Special functions	4.8%	-
55	Algebraic topology	3.1%	-
12	Field theory	2.4%	-
30	Complex analysis	2.4%	-
68	Computer science	2.4%	-
18	Category theory	2.4%	-
57	Manifolds and cell complexes	2.1%	-
13	Commutative algebra	2.1%	-
Other	17 additional fields	21.1%	Includes PDEs, differential geometry, harmonic analysis, statistical mechanics, and more

Notably, 13% of problems combine number theory and combinatorics, 9% combine combinatorics and group theory, and 8% combine number theory and group theory. Over 200 distinct solution techniques are represented across the benchmark, and even the most common techniques (generating functions, recurrences, special functions) each appear in fewer than 5% of problems^[4].

Problem creation and vetting

Creation pipeline

The process for creating and reviewing FrontierMath problems involves multiple stages^[4]:

Stage	Process	Quality control
Problem design	Expert mathematicians create original problems in their research areas	Must satisfy all four core requirements
Solution development	Authors write a solution script in Python that computes the answer	Script must terminate in under one minute
Verification design	Develop automated checking methods using exact matching, SymPy evaluation, or computational verification	Ensure answers are unambiguous
Blind peer review	At least one domain expert mathematician reviews each problem without knowledge of the solution approach	Reviewers assess correctness, ambiguity, guessproofness, and difficulty ratings
Second-round review	A random subset of 25 problems receives an additional blind review	Provides error rate estimates
Error correction	Problems flagged during review are revised or removed	Estimated error rate: roughly 10% (1 incorrect answer found in 25 reviewed problems)
Final validation	Complete verification testing on all accepted problems	Confirms automated checking works reliably

Anti-contamination measures

Because the value of the benchmark depends on problems being unknown to AI training pipelines, Epoch AI employs several security measures^[4]:

All problems are original and previously unpublished
Communication with contributors uses encrypted channels
Problem files are shared via password-protected archives
A core mathematician team manually checks problems against mathematics websites, repositories, and academic publications
Plagiarism detection tools (Quetext and Copyscape) scan the full dataset
The majority of the benchmark remains private, with only a handful of sample problems released publicly

Guessproof verification

Each problem undergoes a guessproofness check to confirm that the answer space is large enough (typically exceeding 10^6 possibilities) and that no obvious patterns would allow a model to stumble on the correct answer. Problems typically require large, non-obvious numerical answers or complex mathematical objects as solutions. The target is a less than 1% success rate for random or heuristic guessing^[4].

Evaluation methodology

Interactive environment

Models are evaluated in an interactive Python environment. The evaluation framework gives each model access to the following capabilities^[4]:

Capability	Description
Code execution	Write and run Python code to perform calculations
Library access	Use standard mathematical libraries (SymPy, NumPy, SciPy, etc.)
Iterative problem solving	Multiple attempts are allowed within the token budget
Result verification	Models can check intermediate results before final submission

For Tier 4 evaluations, models receive a 1,000,000-token hard limit with a 660,000-token warning threshold. The model submits a Python function that returns its answer after reasoning and code execution^[5].

Answer verification

When a model submits its answer, verification proceeds automatically^[4]:

Method	Description	Example
Exact integer matching	Compare submitted integer to known answer	"The answer is 3677073"
SymPy symbolic evaluation	Check if the difference between submitted and known expressions simplifies to zero	Polynomial equality
Computational object verification	Verify properties of submitted mathematical structures	Check that a submitted matrix satisfies required group properties
Numerical tolerance	For floating-point answers, check within a specified tolerance	Approximation results

The model's code must include a specific marker comment (# This is the final answer), save the result using Python's pickle module, and be fully self-contained^[4].

Performance results

Timeline of AI performance on FrontierMath (Tiers 1-3)

The following table shows how model performance has evolved since the benchmark's release^[1]^[7]^[8]^[9]^[10]:

Model	Organization	Score	Date	Notes
GPT-5.4 Pro	OpenAI	~50%	March 2026	Current SOTA; also scored 38% on Tier 4
GPT-5.2 (Thinking)	OpenAI	40.3%	Late 2025	First model above 40%
GPT-5.1	OpenAI	26.7%	2025	Multiple variants at same score
GPT-5	OpenAI	26.3%	2025	-
GPT-5 mini	OpenAI	22.1%	2025	-
o3 (public release)	OpenAI	~10% (Epoch AI), 25.2% (OpenAI internal)	April 2025 / December 2024	Score discrepancy became controversial (see below)
Grok 4	xAI	~14%	2025	-
Gemini 2.5 Pro	Google DeepMind	~11%	2025	-
o3-mini	OpenAI	8.9-9.2%	2025	Medium reasoning setting
Claude Opus 4.1	Anthropic	~7%	2025	Epoch AI evaluation
o1	OpenAI	5.5%	2025	-
DeepSeek R1	DeepSeek	5.2%	2025	Open-source leader at the time
Gemini 2.0 Flash Thinking	Google	2.6%	2025	Experimental version
Claude 3.5 Sonnet	Anthropic	<2%	November 2024	Initial evaluation
GPT-4o	OpenAI	<2%	November 2024	Initial evaluation
o1-preview	OpenAI	<2%	November 2024	Initial evaluation
Gemini 1.5 Pro	Google	<2%	November 2024	Initial evaluation
Grok 2 Beta	xAI	<2%	November 2024	Initial evaluation

Tier 4 performance

Tier 4 scores are reported separately due to the significantly higher difficulty^[5]:

Model	Score	Notes
GPT-5.4 Pro	~38%	March 2026
GPT-5.2 Pro	Highest (pre-correction)	Benefited from grader corrections in v1.1.4
Gemini 3 Pro	19% (+/- 6%)	3 of 48 samples failed due to API errors
Grok 4	2% (+/- 2%)	8 of 48 samples had API errors
DeepSeek V3.2 (Thinking)	~2%	Only Chinese-origin model to score above zero on Tier 4

Initial model behavior patterns

In the original November 2024 evaluation, the paper's authors documented several behavioral patterns across the six tested models^[4]:

o1-preview averaged 1.29 responses per problem, while Grok 2 Beta averaged 3.81 responses per problem
o1-preview and Gemini 1.5 Pro tended to submit answers before seeing experimental results, even when the evaluation framework encouraged iterative testing
Claude 3.5 Sonnet, GPT-4o, and Grok 2 Beta exceeded the 10,000-token limit in over 45% of attempts
Gemini 1.5 Pro hit the token limit in only 16.8% of attempts, using roughly 6,000 tokens on average compared to 12,000-17,000 for other models
Across five runs per model per problem, only four problems total were solved by at least one model; o1-preview was the only model to solve any problem on all five runs

Expert assessments

Four prominent mathematicians were interviewed for the FrontierMath paper: Terence Tao (2006 Fields Medalist), Timothy Gowers (1998 Fields Medalist), Richard Borcherds (1998 Fields Medalist), and Evan Chen (IMO coach and benchmark co-author). Their comments offer a window into how professional mathematicians view the benchmark's difficulty and significance^[4].

Terence Tao

Tao contributed several problems to the benchmark and reviewed others. He described the problems as "extremely challenging" and predicted the benchmark would "resist AIs for several years at least." On the scarcity of relevant training data, Tao observed that for many FrontierMath problems, the relevant material is "almost nonexistent... you're talking like a dozen papers with relevant things"^[4].

Tao suggested that human experts working alongside AI systems could tackle FrontierMath problems within about three years, noting that guiding current AI to correct solutions takes "about five times as much effort" as solving the problems directly. He expected this ratio to improve and eventually drop below 1 for certain problems within a few years, given sufficient tooling and capability improvements^[4].

On practical considerations, Tao remarked that if AI tools require "three days of compute off of all of Google to solve each problem... that's less of a useful tool"^[4].

Timothy Gowers

Gowers reported that "all of the problems I looked at were not really in my area and all looked like things I had no idea how to solve." He emphasized that the problems "appear to be at a different level of difficulty from IMO problems," requiring familiarity with "the tricks of the trade of some particular branch of maths," a kind of domain knowledge that is hard to acquire without substantial, specialized training data^[4].

Gowers also offered a practical vision for AI in mathematics, suggesting that AI systems could help with "slightly boring bits of doing research where you, for example, make some conjecture that would be useful, but you're not quite sure if it's true... it could be a very, very nice time saving device"^[4].

Richard Borcherds

Borcherds was described in the paper as "the most bullish" among the interviewees about AI's potential in mathematics. He did note, however, that the benchmark problems "aren't quite the same as coming up with original proofs," drawing a distinction between solving a problem with a known answer and generating new mathematical knowledge^[4].

Evan Chen

Evan Chen, a well-known mathematics educator and IMO coach who also co-authored the FrontierMath paper, published a separate blog post analyzing the benchmark's design philosophy. He noted that FrontierMath inverts two of the three desirable properties of traditional competition problems (like those at the IMO or Putnam exam). While FrontierMath retains the requirement for creative insight, it deliberately abandons the simplicity requirement and assumes the solver has "access to a Python console and a lot of reference text." Chen praised the authors for being "pretty ruthless about rejecting problems for which they felt it was possible to guess the answer" through engineer's induction^[11].

Chen identified a key advantage of FrontierMath's design: its ability to use "easily verifiable solutions" through code implementation, similar to the International Olympiad in Informatics or Project Euler. This contrasts with pencil-and-paper competitions where human coordinators must evaluate proofs^[11].

The o3 score controversy

OpenAI's initial claim

On December 20, 2024, OpenAI announced its o3 reasoning model and reported a 25.2% score on FrontierMath, a dramatic leap from the previous best of under 2%. This result was highlighted during the o3 launch event as evidence of a breakthrough in mathematical reasoning^[7].

Epoch AI's independent evaluation

On April 18, 2025, Epoch AI published its own independent evaluation of the publicly released o3 model, reporting a score of approximately 10%, significantly below OpenAI's claim. Epoch AI identified several factors that could explain the discrepancy^[8]:

Factor	OpenAI's testing (December 2024)	Epoch AI's testing (April 2025)
Model version	Pre-release internal version	Public release version, "tuned for chat/product use"
Compute resources	"Aggressive test-time compute"	Standard compute tiers
Problem set	180 problems (frontiermath-2024-11-26)	290 problems (frontiermath-2025-02-28)
Scaffolding	Internal advanced scaffold	Public API scaffold

Epoch AI noted: "The difference between our results and OpenAI's might be due to OpenAI evaluating with a more powerful internal scaffold, using more test-time computing, or because those results were run on a different subset of FrontierMath"^[8].

Funding disclosure controversy

The o3 announcement also triggered scrutiny of the financial relationship between OpenAI and Epoch AI. On the same day o3 was announced (December 20, 2024), Epoch AI disclosed that OpenAI had funded the creation of FrontierMath. Several problems quickly emerged^[12]^[13]:

OpenAI had visibility into many of the problems and solutions in the benchmark before the public announcement
The more than 60 contributing mathematicians were not informed of OpenAI's involvement or exclusive early access
Six mathematicians who contributed significantly to the benchmark confirmed to a Stanford PhD student that they were unaware OpenAI would have exclusive access
Epoch AI's associate director acknowledged being "restricted from disclosing the partnership until around the time o3 launched" and stated that "in hindsight we should have negotiated harder for the ability to be transparent to the benchmark contributors as soon as possible"
OpenAI and Epoch AI had a "verbal agreement" that OpenAI would not use FrontierMath's problem set to train its AI models

The controversy drew criticism from multiple outlets. Fortune described it as "manipulative and disgraceful." TechCrunch reported that the benchmarking organization was "criticized for waiting to disclose funding from OpenAI." The incident raised broader questions about independence in AI benchmarking and the risks of conflicts of interest when AI companies fund the benchmarks used to evaluate their own models^[12]^[13].

Epoch AI is primarily funded by Open Philanthropy, and the OpenAI funding for FrontierMath was a separate, project-specific arrangement^[12].

FrontierMath: Open Problems and the first AI solution

The Ramsey hypergraph breakthrough

On March 24, 2026, Epoch AI confirmed that GPT-5.4 Pro had produced a verified solution to a genuinely open mathematical problem on FrontierMath: a Ramsey-style problem on hypergraphs that had remained unsolved since it was posed by mathematicians Will Brian and Paul Larson in a 2019 paper. The solution was first elicited by researchers Kevin Barreto and Liam Price using GPT-5.4 Pro. Problem contributor Will Brian confirmed the solution's correctness, and a write-up is being prepared for publication^[14].

This marked the first time an AI model produced a novel solution to an open problem on the FrontierMath benchmark. After the initial solve, several other frontier models also solved the same problem: Claude Opus 4.6 (max), Gemini 3.1 Pro, and GPT-5.4 (xhigh). The fact that multiple models could solve it suggests the problem sat at the boundary of current frontier model capabilities^[14].

Broader context

The Ramsey hypergraph result is part of a wider trend. Since Christmas 2025, 15 open mathematical problems have moved from unsolved to solved, with 11 of them (73%) credited to AI involvement. However, Epoch AI also noted that when GPT-5.4 Pro was evaluated on the full set of FrontierMath Open Problems, it "did not solve any problems" other than the Ramsey one, and its novel observations on one other problem were "of a form that the author had anticipated and characterized as relatively uninteresting"^[14].

Sample problems

While most problems remain private to prevent contamination, the original paper includes five public sample problems at varying difficulty levels^[4]:

Problem	Author	Difficulty	Field (MSC)	Key techniques	Creativity (hours)	Execution (hours)	Answer
Testing Artin's primitive root conjecture	O. Jarviniemi	Research level	Number theory (11)	Frobenius elements, Artin symbols	4	15	3,677,073
Find degree 19 polynomial	A. Kite	Research level	Algebraic geometry (14), Group theory (20), Number theory (11)	Monodromy, branch loci	3	4	1,876,572,071,974,094,803,391,179
Prime field continuous extensions	D. Chicharro	Graduate level	Number theory (11)	p-adic analysis, recurrences	3	3	9,811
Coxeter group problem	P. Enugandla	Graduate level	Group theory (20)	Coxeter groups, characters	2	3	(not disclosed in sample)
Algebraic geometry/number theory problem	A. Gunning	Undergraduate level	Algebraic geometry (14), Number theory (11)	Hasse-Weil bound	2	2	(not disclosed in sample)

These samples illustrate several features of the benchmark: answers are large, non-obvious integers (making them guessproof); problems span multiple mathematical fields; and even the "easiest" problem requires two hours of creative work from an expert.

Comparison with other benchmarks

Difficulty scaling

Benchmark	AI performance (approximate best)	Typical problem level	Typical solving time (human)	Primary limitation
GSM8K	>95%	Grade school	Minutes	Saturated since 2024
MATH	>90%	High school/competition	30 minutes	Saturated; data contamination risk
AIME	70-90%	Competition mathematics	Hours	Approaching saturation
MMLU (math subset)	>85%	Mixed undergraduate	Varies	Not math-specific
FrontierMath (Tiers 1-3)	~50%	Undergraduate to postdoc	Hours to days	Still challenging; majority unsolved
FrontierMath (Tier 4)	~38%	Research level	Days to weeks	Very few models score above single digits
FrontierMath (Open Problems)	1 problem solved	Unsolved research	Weeks to years	Virtually all problems remain unsolved

What sets FrontierMath apart

Feature	FrontierMath	Typical math benchmarks
Problem source	Original, unpublished, created by active researchers	Often drawn from textbooks, competitions, or publicly available problem sets
Answer verification	Fully automated via Python/SymPy	Often requires human grading or proof checking
Data contamination risk	Minimal (private problem set, encrypted distribution)	High (problems publicly available, may appear in training data)
Difficulty range	Undergraduate through active research	Typically grade school through undergraduate
Time investment per problem	Hours to days for experts	Minutes to hours
Multi-domain integration	44% of problems involve multiple mathematical fields	Most problems stay within a single topic

Notable contributors

Fields Medalists

Name	Fields Medal year	Role
Terence Tao	2006	Problem creation, review, and interview
Timothy Gowers	1998	Problem review and interview
Richard Borcherds	1998	Problem review and interview

Key team members

Name	Role	Background
Elliot Glazer	Lead mathematician	Ph.D. in mathematics from Harvard (set theory under Hugh Woodin)
Tamay Besiroglu	Associate director, Epoch AI	Previously at MIT Future Tech Lab; led strategy for Metaculus
Ege Erdil	Core contributor	Epoch AI researcher
Evan Chen	Co-author and contributor	IMO coach, mathematics educator

Institutional participation

Over 60 mathematicians from leading institutions contributed, including researchers from MIT, Harvard, Princeton, Stanford, Cambridge, Oxford, Cornell, UC Berkeley, King's College London, the University of Leicester, the University of Siegen, ICMC USP (Brazil), and Bristol University, among others.

Implementation details

Evaluation setup

Models interact with a Python environment where they can write and execute code, test hypotheses, and submit answers. A simplified conceptual overview of the evaluation framework:

# Conceptual evaluation framework (simplified)
class FrontierMathEvaluator:
    def evaluate_model(self, model, problem):
        environment = PythonEnvironment()
        max_attempts = 10
        for attempt in range(max_attempts):
            code = model.generate_code(problem, environment.state)
            result = environment.execute(code)
            if model.verify_answer(result, problem):
                return self.check_solution(result, problem.answer)
        return False

Access tiers

Access level	Description	How to obtain
Public samples	Small set of example problems with full solutions	Free access via epoch.ai/frontiermath
Open Problems verifiers	Solution verifiers for the 14 open problems	Partnership with Epoch AI (math@epoch.ai); uniform access fee
Research evaluation	Full benchmark evaluation on the private set	Contact math_evals@epoch.ai
Commercial evaluation	Model testing service	Partnership with Epoch AI
Problem contribution	Submit new problems for inclusion	Expert mathematician credentials required

Funding and development

Funding sources

FrontierMath's development has been supported by:

Open Philanthropy (primary funder of Epoch AI as an organization)
OpenAI (project-specific funding, disclosed December 2024)^[12]
Additional academic and industry partners

Ongoing development

Initiative	Description	Status
Problem expansion	Adding new problems to Tiers 1-4	Ongoing; quarterly updates
Domain coverage	Expanding to additional mathematical fields	2025-2026
Tier 4 updates	Bug fixes and grader corrections (version bumped to 1.1.4 in 2026)	Ongoing
Open Problems growth	Expanding beyond the 14-problem pilot set	Planning stage
Verification improvements	Refining automated checking methods	Continuous

Impact and significance

Influence on AI research

FrontierMath has had a measurable effect on the AI research community since its release:

It demonstrated that benchmark saturation on easier datasets (GSM8K, MATH) did not indicate genuine mathematical reasoning capability
The benchmark's design principles, particularly its emphasis on originality, automated verification, and guessproofness, have influenced the design of subsequent benchmarks
The o3 score controversy prompted broader discussion about transparency in AI benchmarking and the risks of vendor-funded evaluation
The Open Problems component established a new category of AI evaluation: testing whether models can advance the frontier of human knowledge, not merely match it

Progress tracking

The trajectory from under 2% (November 2024) to roughly 50% (March 2026) on Tiers 1-3 is one of the fastest rates of improvement on any major AI benchmark. Yet the benchmark remains far from saturated. Tier 4 scores remain below 40% for the best model and in single digits for most, and virtually all Open Problems remain unsolved^[1]^[5]^[14].

Limitations and criticisms

Known limitations

Limitation	Description	Mitigation
Limited public access	Most problems are private to preserve benchmark integrity	Necessary trade-off; sample problems are publicly available
Narrow scope	Only tests mathematical problem-solving; does not assess proof writing, mathematical intuition, or pedagogical ability	Complements other benchmarks
English only	All problems are written in English	Future multilingual expansion is planned
Computational bias	Problems must have automatically verifiable answers, excluding proof-based and open-ended mathematical reasoning	Acknowledged limitation of the automated verification approach
Estimated error rate	Roughly 10% of problems may contain errors based on review sampling	Ongoing review and correction process

Criticisms

Several criticisms have been raised since the benchmark's launch:

The funding transparency failure undermined trust in the benchmark's independence, even though the problems themselves were created by independent mathematicians^[12]^[13]
The discrepancy between OpenAI's reported 25.2% and Epoch AI's measured 10% for o3 highlighted the difficulty of comparing results when testing conditions differ^[8]
Limited access to the full problem set makes independent replication difficult, though this restriction exists to prevent data contamination
Some mathematicians have questioned whether problems with automatically verifiable answers represent the full range of mathematical reasoning, since much of research mathematics involves constructing proofs rather than computing specific values^[4]

References

Epoch AI. "FrontierMath: A benchmark for evaluating advanced mathematical reasoning in AI." https://epoch.ai/frontiermath
Epoch AI. "Introducing FrontierMath: Open Problems." Epoch AI Substack, January 2026. https://epochai.substack.com/p/introducing-frontiermath-open-problems
VentureBeat. "AI's math problem: FrontierMath benchmark shows how far technology still has to go." November 2024. https://venturebeat.com/ai/ais-math-problem-frontiermath-benchmark-shows-how-far-technology-still-has-to-go
Glazer, E., Erdil, E., Besiroglu, T., et al. "FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI." arXiv:2411.04872, November 2024. https://arxiv.org/abs/2411.04872
Epoch AI. "FrontierMath Tier 4." https://epoch.ai/benchmarks/frontiermath-tier-4
Epoch AI. "FrontierMath: Open Problems - Hadamard Matrices." https://epoch.ai/frontiermath/open-problems/hadamard
OpenAI. "Announcing o3." December 20, 2024.
TechCrunch. "OpenAI's o3 AI model scores lower on a benchmark than the company initially implied." April 20, 2025. https://techcrunch.com/2025/04/20/openais-o3-ai-model-scores-lower-on-a-benchmark-than-the-company-initially-implied/
llm-stats.com. "FrontierMath Benchmark Leaderboard." https://llm-stats.com/benchmarks/frontiermath
OpenAI. "Advancing science and math with GPT-5.2." https://openai.com/index/gpt-5-2-for-science-and-math/
Chen, E. "FrontierMath." Power Overwhelming (blog), November 10, 2024. https://blog.evanchen.cc/2024/11/10/frontiermath/
TechCrunch. "AI benchmarking organization criticized for waiting to disclose funding from OpenAI." January 19, 2025. https://techcrunch.com/2025/01/19/ai-benchmarking-organization-criticized-for-waiting-to-disclose-funding-from-openai/
Fortune. "'Manipulative and disgraceful': OpenAI's critics seize on math benchmarking scandal." January 2025. https://fortune.com/2025/01/21/eye-on-ai-openai-o3-math-benchmark-frontiermath-epoch-altman-trump-biden/
WinBuzzer. "GPT-5.4 Pro Cracks Open Math Problem." March 24, 2026. https://winbuzzer.com/2026/03/24/gpt-54-pro-solves-open-math-problem-epoch-ai-frontiermath-xcxwbn/

Background and motivation

Structure and components

Tiers 1-3 (base set)

Tier 4 (expansion set)

Open Problems

Problem design

Core requirements

Difficulty rating system

Mathematical domain coverage

Problem creation and vetting

Creation pipeline

Anti-contamination measures

Guessproof verification

Evaluation methodology

Interactive environment

Answer verification

Performance results

Timeline of AI performance on FrontierMath (Tiers 1-3)

Tier 4 performance

Initial model behavior patterns

Expert assessments

Terence Tao

Timothy Gowers

Richard Borcherds

Evan Chen

The o3 score controversy

OpenAI's initial claim

Epoch AI's independent evaluation

Funding disclosure controversy

FrontierMath: Open Problems and the first AI solution

The Ramsey hypergraph breakthrough

Broader context

Sample problems

Comparison with other benchmarks

Difficulty scaling

What sets FrontierMath apart

Notable contributors

Fields Medalists

Key team members

Institutional participation

Implementation details

Evaluation setup

Access tiers

Funding and development

Funding sources

Ongoing development

Impact and significance

Influence on AI research

Progress tracking

Limitations and criticisms

Known limitations

Criticisms

See also

References

Improve this article

Related Articles

Machine learning terms/Fairness

Humanity's Last Exam

ARC-AGI 2

DeepSeek 3.0

Open-source AI

AI search

Background and motivation

Structure and components

Tiers 1-3 (base set)

Tier 4 (expansion set)

Open Problems

Problem design

Core requirements

Difficulty rating system

Mathematical domain coverage

Problem creation and vetting

Creation pipeline

Anti-contamination measures

Guessproof verification

Evaluation methodology

Interactive environment

Answer verification

Performance results

Timeline of AI performance on FrontierMath (Tiers 1-3)