OlympiadBench
Last reviewed
Jun 8, 2026
Sources
3 citations
Review status
Source-backed
Revision
v1 · 1,479 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
3 citations
Review status
Source-backed
Revision
v1 · 1,479 words
Add missing citations, update stale details, or suggest a clearer explanation.
OlympiadBench is an AI benchmark of 8,476 Olympiad-level mathematics and physics problems, designed to test the advanced scientific reasoning of large language models and large multimodal models. Introduced in February 2024 by researchers at Tsinghua University, Beihang University, and Wisdom Way AI Lab, the benchmark is bilingual (English and Chinese), spans both text-only and multimodal problems, and provides expert step-by-step solution annotations for every item. It was published at the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024) under the title "OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems" [1][2].
OlympiadBench was created in response to a recurring problem in AI evaluation: as frontier models began saturating widely used math benchmarks such as GSM8K and MATH, those datasets became less able to distinguish strong models from weaker ones. By drawing on the hardest tier of human competition problems, OlympiadBench aimed to provide headroom that would remain challenging well after release. At launch, the best system tested, GPT-4V, reached only 17.97 percent overall, far below the performance of human Olympiad medalists [1][2].
The benchmark is built around problems from elite scientific competitions and high-stakes exams, including the International Mathematical Olympiad (IMO), the International Physics Olympiad (IPhO), the American Regions Mathematics League (ARML), and the mathematics and physics sections of the Chinese College Entrance Examination, commonly known as the Gaokao. These sources span a range of difficulty, from competition-level problems aimed at the strongest students to exam-level problems faced by a much broader population [1].
OlympiadBench differs from most contemporaneous math benchmarks in several ways. It is bilingual rather than English-only, it includes physics alongside mathematics, a majority of its problems are multimodal and require interpreting diagrams or figures, and it includes formal theorem-proving problems in addition to open-ended numerical or symbolic answers. Each problem carries an expert annotation describing the full step-by-step solution, which supports fine-grained error analysis of model reasoning rather than only scoring final answers [1][2].
OlympiadBench targets capabilities that simpler arithmetic or word-problem benchmarks do not exercise. Because problems are drawn from Olympiad competitions, solving them typically requires multi-step deductive reasoning, the application of advanced theorems, and creative problem-solving rather than pattern matching on familiar templates.
Several axes make the benchmark hard:
The authors also use the step-by-step annotations to perform a manual error analysis of GPT-4V, identifying recurrent failure modes that include hallucinations, knowledge omissions, and logical fallacies [1].
OlympiadBench contains 8,476 problems in its released form. The dataset is organized along several dimensions: subject (mathematics or physics), answer type (open-ended or theorem-proving), language (English or Chinese), and the presence or absence of an accompanying image. Mathematics problems make up the larger share of the dataset, and a majority of all problems are in Chinese, reflecting the heavy use of Chinese competition and Gaokao sources [1].
The table below summarizes the principal compositional figures reported for the benchmark. Some breakdown counts come from the dataset construction tables in the paper and describe the broader problem pool before final filtering, so they should be read as approximate proportions rather than exact partitions of the 8,476 released items [1].
| Attribute | Detail |
|---|---|
| Total problems (released) | 8,476 |
| Subjects | Mathematics and physics |
| Languages | English and Chinese (bilingual) |
| Problems with images | About 57 percent (multimodal) |
| Answer types | Open-ended (about 81 percent) and theorem-proving (about 19 percent) |
| Difficulty tiers | Competition-level and college-entrance-exam level |
| Annotations | Expert step-by-step solutions for every problem |
| Example sources | IMO, IPhO, ARML, Chinese Gaokao |
The expert solution annotations are a defining feature. Rather than supplying only a final answer key, OlympiadBench records the intended reasoning chain, which the authors use both to grade open-ended answers and to diagnose where a model's reasoning diverges from a correct path. The dataset and evaluation code are released through the OpenBMB project on GitHub [1].
At release, OlympiadBench sharply separated strong and weak systems, and even the best model performed poorly in absolute terms. GPT-4V achieved an average score of 17.97 percent across the benchmark, with markedly lower performance on physics, at 10.74 percent, than on mathematics. The authors note that GPT-4V's average accuracy was more than five times that of the best open-source multimodal model they tested, the Yi-VL-34B model, underscoring how far open systems trailed proprietary ones on this task at the time [1][2].
Because not all problems require images, the paper also evaluates text-only language models on the text-only subset. These results indicate that strong math-specialized and general-purpose language models could outperform multimodal models on the portions of the benchmark that do not require visual reasoning, while still scoring far below human experts. The figures below reflect representative results reported in the paper; exact per-model numbers vary by subset and split [1].
| Model | Type | Approx. performance |
|---|---|---|
| GPT-4V | Multimodal | 17.97 percent average overall; 10.74 percent physics |
| Yi-VL-34B | Open multimodal | Roughly one-fifth of GPT-4V's average |
| DeepSeekMath-RL | Text-only, math-specialized | Competitive on the text-only math subset |
| GPT-4 | Text-only | Higher than multimodal models on the text-only subset |
The headline takeaway is that no evaluated model approached competence on Olympiad-level scientific reasoning, and that multimodal, physics, and non-English problems were the most difficult categories. The authors frame this gap as evidence that the benchmark provides substantial headroom for measuring progress toward more general AI capabilities [1][2].
OlympiadBench arrived during a wave of efforts to build harder reasoning evaluations as earlier datasets approached saturation. Its distinctive combination of multimodality, bilingual coverage, physics, and theorem proving set it apart from text-only mathematics benchmarks released around the same period.
A closely related benchmark is Omni-MATH, introduced in October 2024, which assembles 4,428 competition-level mathematics problems with rigorous human annotation and organizes them into more than 33 sub-domains and over 10 difficulty levels. Whereas OlympiadBench spans both mathematics and physics and emphasizes multimodal problems, Omni-MATH focuses exclusively on text-only Olympiad mathematics in order to probe the boundaries of language-model mathematical reasoning in finer detail [3].
OlympiadBench is also frequently discussed alongside competition-based evaluations built from the American Invitational Mathematics Examination (AIME) and the broader push to reach milestones such as solving International Mathematical Olympiad problems. The rapid rise of reasoning-focused models after 2024, including systems trained with extended chain-of-thought such as OpenAI's o1, drove large gains on Olympiad-style mathematics: on Omni-MATH, for example, o1-preview and o1-mini reported accuracies above 50 percent, far higher than the sub-20-percent scores typical of the models OlympiadBench first tested. This progression illustrates how quickly Olympiad-level benchmarks went from being almost out of reach to being partially solved, and why new and harder evaluations continued to appear [3].
For the field of AI evaluation, OlympiadBench remains a reference point as one of the early large-scale, bilingual, multimodal scientific reasoning benchmarks. Its expert step-by-step annotations and explicit error taxonomy, covering hallucinations, knowledge omissions, and logical fallacies, helped establish a template for analyzing not just whether a model gets the right answer, but where and how its reasoning breaks down [1][2].