OlympiadBench

AI Benchmarks Model Evaluation

7 min read

Updated Jun 8, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 8, 2026

Fact-checked

In review queue

Sources

3 citations

Revision

v1 · 1,479 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

OlympiadBench is an AI benchmark of 8,476 Olympiad-level mathematics and physics problems, designed to test the advanced scientific reasoning of large language models and large multimodal models. Introduced in February 2024 by researchers at Tsinghua University, Beihang University, and Wisdom Way AI Lab, the benchmark is bilingual (English and Chinese), spans both text-only and multimodal problems, and provides expert step-by-step solution annotations for every item. It was published at the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024) under the title "OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems" ^[1]^[2].

OlympiadBench was created in response to a recurring problem in AI evaluation: as frontier models began saturating widely used math benchmarks such as GSM8K and MATH, those datasets became less able to distinguish strong models from weaker ones. By drawing on the hardest tier of human competition problems, OlympiadBench aimed to provide headroom that would remain challenging well after release. At launch, the best system tested, GPT-4V, reached only 17.97 percent overall, far below the performance of human Olympiad medalists ^[1]^[2].

Overview

The benchmark is built around problems from elite scientific competitions and high-stakes exams, including the International Mathematical Olympiad (IMO), the International Physics Olympiad (IPhO), the American Regions Mathematics League (ARML), and the mathematics and physics sections of the Chinese College Entrance Examination, commonly known as the Gaokao. These sources span a range of difficulty, from competition-level problems aimed at the strongest students to exam-level problems faced by a much broader population ^[1].

OlympiadBench differs from most contemporaneous math benchmarks in several ways. It is bilingual rather than English-only, it includes physics alongside mathematics, a majority of its problems are multimodal and require interpreting diagrams or figures, and it includes formal theorem-proving problems in addition to open-ended numerical or symbolic answers. Each problem carries an expert annotation describing the full step-by-step solution, which supports fine-grained error analysis of model reasoning rather than only scoring final answers ^[1]^[2].

What OlympiadBench tests

OlympiadBench targets capabilities that simpler arithmetic or word-problem benchmarks do not exercise. Because problems are drawn from Olympiad competitions, solving them typically requires multi-step deductive reasoning, the application of advanced theorems, and creative problem-solving rather than pattern matching on familiar templates.

Several axes make the benchmark hard:

Multimodality. About 57 percent of the problems include images such as geometric figures, physics force diagrams, function plots, or circuit schematics, so a model must jointly reason over text and visual content. This is why the benchmark is positioned as a test for large multimodal models, not only text-based language models ^[1].
Physics reasoning. Physics problems require applying physical laws, tracking units, and building a quantitative model of a described system. In the paper, physics consistently proved harder for models than mathematics ^[1].
Theorem proving. A subset of problems asks the model to produce a proof rather than a final answer, which the authors evaluate separately. GPT-4V answered only 6 of 81 Chinese competition-level theorem-proving questions correctly, illustrating how far short models fell on rigorous deductive tasks ^[1].
Bilingual evaluation. Problems appear in both English and Chinese, and the authors report that non-English problems generally posed greater difficulty for the models tested ^[1].

The authors also use the step-by-step annotations to perform a manual error analysis of GPT-4V, identifying recurrent failure modes that include hallucinations, knowledge omissions, and logical fallacies ^[1].

Structure and dataset

OlympiadBench contains 8,476 problems in its released form. The dataset is organized along several dimensions: subject (mathematics or physics), answer type (open-ended or theorem-proving), language (English or Chinese), and the presence or absence of an accompanying image. Mathematics problems make up the larger share of the dataset, and a majority of all problems are in Chinese, reflecting the heavy use of Chinese competition and Gaokao sources ^[1].

The table below summarizes the principal compositional figures reported for the benchmark. Some breakdown counts come from the dataset construction tables in the paper and describe the broader problem pool before final filtering, so they should be read as approximate proportions rather than exact partitions of the 8,476 released items ^[1].

Attribute	Detail
Total problems (released)	8,476
Subjects	Mathematics and physics
Languages	English and Chinese (bilingual)
Problems with images	About 57 percent (multimodal)
Answer types	Open-ended (about 81 percent) and theorem-proving (about 19 percent)
Difficulty tiers	Competition-level and college-entrance-exam level
Annotations	Expert step-by-step solutions for every problem
Example sources	IMO, IPhO, ARML, Chinese Gaokao

The expert solution annotations are a defining feature. Rather than supplying only a final answer key, OlympiadBench records the intended reasoning chain, which the authors use both to grade open-ended answers and to diagnose where a model's reasoning diverges from a correct path. The dataset and evaluation code are released through the OpenBMB project on GitHub ^[1].

Results

At release, OlympiadBench sharply separated strong and weak systems, and even the best model performed poorly in absolute terms. GPT-4V achieved an average score of 17.97 percent across the benchmark, with markedly lower performance on physics, at 10.74 percent, than on mathematics. The authors note that GPT-4V's average accuracy was more than five times that of the best open-source multimodal model they tested, the Yi-VL-34B model, underscoring how far open systems trailed proprietary ones on this task at the time ^[1]^[2].

Because not all problems require images, the paper also evaluates text-only language models on the text-only subset. These results indicate that strong math-specialized and general-purpose language models could outperform multimodal models on the portions of the benchmark that do not require visual reasoning, while still scoring far below human experts. The figures below reflect representative results reported in the paper; exact per-model numbers vary by subset and split ^[1].

Model	Type	Approx. performance
GPT-4V	Multimodal	17.97 percent average overall; 10.74 percent physics
Yi-VL-34B	Open multimodal	Roughly one-fifth of GPT-4V's average
DeepSeekMath-RL	Text-only, math-specialized	Competitive on the text-only math subset
GPT-4	Text-only	Higher than multimodal models on the text-only subset

The headline takeaway is that no evaluated model approached competence on Olympiad-level scientific reasoning, and that multimodal, physics, and non-English problems were the most difficult categories. The authors frame this gap as evidence that the benchmark provides substantial headroom for measuring progress toward more general AI capabilities ^[1]^[2].

Significance and relationship to other math benchmarks

OlympiadBench arrived during a wave of efforts to build harder reasoning evaluations as earlier datasets approached saturation. Its distinctive combination of multimodality, bilingual coverage, physics, and theorem proving set it apart from text-only mathematics benchmarks released around the same period.

A closely related benchmark is Omni-MATH, introduced in October 2024, which assembles 4,428 competition-level mathematics problems with rigorous human annotation and organizes them into more than 33 sub-domains and over 10 difficulty levels. Whereas OlympiadBench spans both mathematics and physics and emphasizes multimodal problems, Omni-MATH focuses exclusively on text-only Olympiad mathematics in order to probe the boundaries of language-model mathematical reasoning in finer detail ^[3].

OlympiadBench is also frequently discussed alongside competition-based evaluations built from the American Invitational Mathematics Examination (AIME) and the broader push to reach milestones such as solving International Mathematical Olympiad problems. The rapid rise of reasoning-focused models after 2024, including systems trained with extended chain-of-thought such as OpenAI's o1, drove large gains on Olympiad-style mathematics: on Omni-MATH, for example, o1-preview and o1-mini reported accuracies above 50 percent, far higher than the sub-20-percent scores typical of the models OlympiadBench first tested. This progression illustrates how quickly Olympiad-level benchmarks went from being almost out of reach to being partially solved, and why new and harder evaluations continued to appear ^[3].

For the field of AI evaluation, OlympiadBench remains a reference point as one of the early large-scale, bilingual, multimodal scientific reasoning benchmarks. Its expert step-by-step annotations and explicit error taxonomy, covering hallucinations, knowledge omissions, and logical fallacies, helped establish a template for analyzing not just whether a model gets the right answer, but where and how its reasoning breaks down ^[1]^[2].

References

He, Chaoqun; Luo, Renjie; Bai, Yuzhuo; Hu, Shengding; Thai, Zhen Leng; Shen, Junhao; Hu, Jinyi; Han, Xu; Huang, Yujie; Zhang, Yuxiang; Liu, Jie; Qi, Lei; Liu, Zhiyuan; Sun, Maosong. "OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems." arXiv preprint arXiv:2402.14008, February 2024. https://arxiv.org/abs/2402.14008 ↩
He, Chaoqun; et al. "OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems." Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, pp. 3828 to 3850. https://aclanthology.org/2024.acl-long.211/ ↩
Gao, Bofei; et al. "Omni-MATH: A Universal Olympiad Level Mathematic Benchmark for Large Language Models." arXiv preprint arXiv:2410.07985, October 2024. https://arxiv.org/abs/2410.07985 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

Omni-MATH

Overview

What OlympiadBench tests

Structure and dataset

Results

Significance and relationship to other math benchmarks

References

Improve this article

Related Articles

Benchmark (AI)

MATH

SWE-bench Verified

WebArena

Agent evaluation

Terminal-Bench