FormulaOne

AI Benchmarks Model Evaluation

8 min read

Updated Jun 8, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 8, 2026

Fact-checked

In review queue

Sources

1 citation

Revision

v1 · 1,508 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Overview

FormulaOne is an AI benchmark designed to measure the depth of algorithmic reasoning in frontier large language models, introduced in the July 2025 paper "FormulaOne: Measuring the Depth of Algorithmic Reasoning Beyond Competitive Programming" ^[1]. The benchmark sits at the intersection of graph theory, formal logic, and algorithm design. It asks a model to produce a correct and efficient dynamic programming algorithm for graph problems that are framed in Monadic Second-Order (MSO) logic and posed over "tree-like" graphs of bounded treewidth. Crucially, these problems are provably tractable, yet they remain extraordinarily difficult to solve in practice ^[1].

The headline result is stark: state-of-the-art reasoning models, including OpenAI o3, solve well under 1 percent of the hard problems, with one frontier system scoring exactly 0 percent, even though the task domain (algorithmic coding) is one in which these models otherwise excel ^[1]. The authors argue that this exposes a large gap between competitive programming skill, where models such as o3 now rate among elite human competitors, and the genuine algorithmic depth required for research-grade problems.

The paper was authored by a team that includes Gal Beniamini, Yuval Dor, Alon Vinnikov, Shir Granot Peled, Or Weinstein, Or Sharir, Noam Wies, Tomer Nussbaum, Ido Ben Shaul, Tomer Zekharya, Yoav Levine, Shai Shalev-Shwartz, and Amnon Shashua, listed under the group name "AAI" ^[1].

What FormulaOne tests

FormulaOne is explicitly positioned as a test of deep "in-distribution" reasoning rather than out-of-distribution generalization. The authors note that algorithmic coding is the "bread and butter" of modern reasoning models' training, so frontier systems, "by rights," should perform well ^[1]. The benchmark instead probes whether a model can carry out the long, intertwined chain of mathematical reasoning that a strong human expert would need.

The motivation comes from a perceived mismatch in existing evaluations. The paper notes that OpenAI o3 reportedly attained a rating of 2,724 on Codeforces and ranked roughly 175th among all human competitors, and that top models have been rapidly approaching the performance ceiling on platforms like Codeforces, CodeElo, and LiveCodeBench ^[1]. The authors contend that the skills honed for these timed contests do not capture the reasoning needed for large-scale, real-world research problems such as optimizing global supply chains, managing power grids, or designing resilient network infrastructure, all of which they describe as orders of magnitude harder than typical competitive programming ^[1].

Each FormulaOne task gives the model a graph, a tree decomposition of that graph, and a weight function, and asks it to write code that computes a Weighted Model Counting (WMC) objective: the sum of the weights of all vertex (or edge) subsets satisfying a given graph property, reported modulo the prime 10^9 + 7 ^[1]. The defining property of each problem is expressed as a logical formula, and the intended solution is a dynamic programming algorithm that runs over the supplied tree decomposition. The example problems in the paper include classical notions such as independent sets and dominating sets, as well as more unusual constraints such as "the induced subgraph contains no cycle of length four" ^[1].

Structure and background

Two datasets

FormulaOne is released as two datasets ^[1]:

Dataset	Size	Role
FormulaOne	120 problems	Core benchmark of challenging dynamic-programming problems
FormulaOne-Warmup	100 problems	Auxiliary set of simpler problems from the same distribution

The main FormulaOne set consists of 120 challenging dynamic programming problems that the authors describe as testing creativity, sophistication, and expert-level reasoning. FormulaOne-Warmup contains 100 simpler problems drawn from the same family, provided to support research and evaluation in this demanding setting ^[1]. The full corpus is released together with an evaluation framework, and the authors emphasize that this released set represents only a "fraction," or "the tip of the iceberg," of what their generation method can produce ^[1].

MSO logic, treewidth, and Courcelle's theorem

The theoretical foundation of FormulaOne is an algorithmic meta-theorem due to Bruno Courcelle (1990). Informally, Courcelle's theorem states that for every sufficiently "tree-like" graph, any property definable in Monadic Second-Order logic can be decided by an algorithm whose running time is linear in the size of the graph ^[1]. MSO logic is a formal language for stating properties of graphs in which one may quantify over individual vertices and edges and, going beyond first-order logic, over sets of vertices and sets of edges (the "second-order," "monadic" part). Each FormulaOne problem is defined by an MSO formula with a single free variable that stands for the input set to be counted ^[1].

"Tree-like" is made precise through the notion of treewidth, a measure of how close a graph is to being a tree. A tree decomposition organizes a graph's vertices into overlapping "bags" arranged in a tree, and bounded treewidth means every bag stays small. The supplied tree decomposition is the auxiliary structure that lets an efficient dynamic programming algorithm sweep over the graph one bag at a time. The paper notes that most of its formulas live in a subset of MSO known as MSO1, and that the problems are guaranteed to admit efficient solutions precisely because Courcelle's theorem applies ^[1].

This combination is what makes the benchmark unusual. Every problem, from the simplest to the most complex, is generated from the same family (MSO logic on graphs) and is theoretically guaranteed to have an efficient solution. The difficulty lies entirely in discovering that solution: designing the correct dynamic programming state, handling the combinatorial cases, and implementing it precisely. The authors report that exhibiting a full solution to even one such problem requires no fewer than 15 interdependent, highly complex reasoning steps, intertwined in non-obvious ways, and they conjecture that this reasoning depth is the main reason frontier models "flat-line" on the benchmark ^[1].

The semi-automatic, MSO-driven generation process is also presented as a path toward an essentially unbounded source of high-depth algorithmic problems for Reinforcement Learning with Verifiable Rewards (RLVR), since each generated problem has automatically checkable answers ^[1].

Results

The paper evaluates leading frontier reasoning models and reports that they "fail entirely" on FormulaOne. According to the paper, OpenAI o3 solves less than 1 percent of the questions even when given 10 attempts and explanatory few-shot examples ^[1]. The summary chart of frontier-model performance on the FormulaOne dataset reports the following success rates ^[1]:

Model	Setting	Success rate on FormulaOne
o3 (High)	10 attempts	0.8 percent
o3-Pro (High)	1 attempt	0.8 percent
Gemini 2.5 Pro	10 attempts	0.8 percent
Grok 4 Heavy	1 attempt	0 percent

These figures, all at or below roughly 0.8 percent, indicate that the strongest reasoning systems available at the time of writing essentially could not solve the hard FormulaOne problems, despite being given multiple attempts and worked examples ^[1]. The authors contrast this with the same models' strong, near-saturating performance on competitive-programming benchmarks, underscoring that the failure is specific to research-grade algorithmic depth rather than coding ability in general ^[1].

Significance

FormulaOne is framed by its authors as a "new ARC" for mathematical and algorithmic reasoning, by analogy with the Abstraction and Reasoning Corpus ^[1]. Where ARC tests fluid intelligence on tasks that are deliberately out-of-distribution, FormulaOne takes the complementary stance of probing deep interpolation: it shows that even inside a familiar domain, frontier models cannot yet perform the sustained multi-step reasoning required to solve the problems. The authors argue that this makes the near-total failure especially notable, and that it suggests competitive-programming benchmarks may overstate the reasoning ability of current models ^[1].

The benchmark also carries unusual theoretical weight. Many FormulaOne problems are connected to central conjectures in fine-grained complexity, in particular the Strong Exponential Time Hypothesis (SETH), which informally asserts that the brute-force algorithm for Boolean satisfiability is essentially optimal ^[1]. The paper notes that the best-known running times for a large portion of the dataset are believed to be optimal under SETH, parameterized by the treewidth of the input graph. A consequence the authors highlight is that if an AI system genuinely discovered a faster algorithm for one of these hard tasks, it would not merely solve a puzzle: it would effectively refute a central hypothesis in theoretical computer science ^[1].

Finally, the authors present FormulaOne as a foundation for next-generation training environments. Because the MSO-based generator can produce a virtually unbounded stream of problems that combine deep mathematical difficulty with guaranteed, automatically verifiable correctness, they position it as well-suited to building RLVR environments aimed at advancing algorithmic reasoning, in contrast to existing benchmarks that tend to be either conceptually shallow or fixed and limited in size ^[1].

References

Beniamini, G., Dor, Y., Vinnikov, A., Granot Peled, S., Weinstein, O., Sharir, O., Wies, N., Nussbaum, T., Ben Shaul, I., Zekharya, T., Levine, Y., Shalev-Shwartz, S., and Shashua, A. "FormulaOne: Measuring the Depth of Algorithmic Reasoning Beyond Competitive Programming." arXiv preprint arXiv:2507.13337, July 17, 2025. https://arxiv.org/abs/2507.13337 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

SWE-bench Verified Terminal-Bench

Overview

What FormulaOne tests

Structure and background

Two datasets

MSO logic, treewidth, and Courcelle's theorem

Results

Significance

References

Improve this article

Related Articles

Benchmark (AI)

MATH

SWE-bench Verified

WebArena

Agent evaluation

Terminal-Bench

What links here

Related Articles

Benchmark (AI)

MATH

SWE-bench Verified

WebArena

Agent evaluation

Terminal-Bench

What links here