AI gold medals at the 2025 IMO
Last reviewed
Jun 8, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 1,805 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 1,805 words
Add missing citations, update stale details, or suggest a clearer explanation.
In July 2025, two leading artificial intelligence laboratories, OpenAI and Google DeepMind, independently reported that their large language models had achieved gold-medal-level performance at the 2025 International Mathematical Olympiad (IMO), the world's foremost mathematics competition for pre-university students. Each system solved 5 of the 6 problems for a score of 35 out of 42 points, exactly the gold-medal threshold at the 2025 event. Both produced full proofs in natural language under competition-like time limits, without the formal proof assistants or geometry-specific engines that had powered earlier systems. [1][2][3]
The achievement was widely described as a landmark in machine reasoning, because it was reached by general-purpose reasoning models rather than the specialized theorem-proving systems that had reached silver-medal level a year earlier. It was also accompanied by a public dispute over how and when the results were announced, and over the differing degrees of formal verification behind the two claims. Google DeepMind's result was officially graded and certified by the IMO's own coordinators under pre-agreed conditions; OpenAI announced its result independently, with grading by former IMO medalists rather than by the official jury. [2][3][4]
The International Mathematical Olympiad, first held in 1959, is the premier secondary-school mathematics competition. Contestants face six problems over two exam days, with 4.5 hours per day and three problems per day. Each problem is scored from 0 to 7 points, for a maximum of 42. The 66th IMO was hosted in the Sunshine Coast region of Queensland, Australia, with the overall event running from 10 to 20 July 2025; the two competition exams took place in mid-July. The 2025 edition drew 630 contestants from 110 countries. [5][6]
The 2025 problems were unusually demanding at the top: the gold-medal cutoff was set at 35 points, reported to be the highest gold threshold in the competition's history. Of the 630 contestants, 72 received gold medals, 104 silver, and 145 bronze, and only five contestants achieved a perfect 42. Problem 6 was the hardest of the set and was left unsolved by a large majority of human competitors. [5][6]
Reaching gold-medal performance on full, previously unseen olympiad problems under competition-like conditions had been a long-standing goal in AI. Olympiad problems require multi-step creative reasoning and rigorous proof, not pattern recall, and they cannot be solved by retrieving memorized answers. In July 2024, Google DeepMind reported the first major milestone: a combined system of AlphaProof and AlphaGeometry 2 solved 4 of the 6 problems from the 2024 IMO for 28 points, matching the silver-medal threshold that year. That result relied on specialized methods: problems were first manually translated into the formal language Lean by human experts, AlphaProof searched for formal proofs using reinforcement learning, and the system took up to three days on some problems rather than the 4.5-hour human limit. [1][2]
In July 2025 the picture changed in two important ways. First, the systems that reached gold were general-purpose reasoning models rather than dedicated provers. Second, they worked end to end in natural language, reading the official problem statements and writing human-readable proofs, without expert translation into a formal language. [2][3]
| Google DeepMind | OpenAI | |
|---|---|---|
| System | Advanced version of Gemini with "Deep Think" | Experimental reasoning model (unreleased) |
| Problems solved | 5 of 6 (Problem 6 unsolved) | 5 of 6 (Problem 6 unsolved) |
| Score | 35 / 42 (gold threshold) | 35 / 42 (gold threshold) |
| Output | Natural-language proofs | Natural-language proofs |
| Time limit | Within the 4.5-hour-per-paper contest window | Same time limits as human contestants, no tools or internet |
| Grading | Officially graded and certified by IMO coordinators | Graded by former IMO medalists, not officially certified |
| Announced | 21 July 2025 | Around 19 July 2025 |
Google DeepMind used an advanced version of its Gemini model running in a "Deep Think" mode that explores multiple candidate solution paths in parallel before committing to an answer. The company attributed the improvement over 2024 to a combination of reinforcement learning on multi-step reasoning and proof data, a curated corpus of high-quality mathematical solutions, and general guidance on olympiad problem-solving approaches. Crucially, the model received the problems in ordinary language and produced complete proofs in ordinary language within the competition time limit, a sharp contrast with the Lean-based AlphaProof pipeline of 2024. [2]
DeepMind's result was scored by the IMO's official coordinators, the same graders who assess human contestants, against the competition's confidential marking scheme. IMO President Gregor Dolinar confirmed the outcome, and the graders described the model's solutions as clear and precise. The result, 35 of 42 points, met the gold-medal standard. DeepMind announced the achievement on 21 July 2025. [2]
OpenAI reported the same headline score, 35 of 42, also solving five problems and failing only Problem 6. The work was carried out by a small OpenAI research team; researcher Alexander Wei announced the result in a thread on X around 19 July 2025, crediting colleagues including Sheryl Hsu and Noam Brown. OpenAI emphasized that the system was an experimental, general-purpose reasoning model rather than a math-specific tool, and that the result was driven by scaling test-time compute and reinforcement learning on hard-to-verify tasks. The model worked under the same time limits as human contestants, without tools, internet access, or external resources, and wrote proofs in natural language. OpenAI stated that the model was a research prototype and that it did not plan to release anything at that level of mathematical capability for several months. [3][4]
The principal difference was in verification: OpenAI did not participate in the IMO's official coordination process, and its solutions were graded by three former IMO medalists rather than by the competition's official jury using the confidential marking scheme. [3][4]
The two announcements, made within days of each other, triggered a public dispute on two fronts: the timing of OpenAI's announcement, and the legitimacy of an uncertified gold claim. Both sides should be weighed neutrally, as accounts of what was agreed differ. [3][4]
On timing, the IMO had requested that AI laboratories refrain from publicizing their results until after the competition's closing ceremony, so that the human medalists would receive their recognition first; some accounts describe a requested waiting period extending roughly a week beyond the ceremony. Google DeepMind said it had coordinated formally with the IMO and held its announcement until after the ceremony out of respect for that request, and DeepMind chief executive Demis Hassabis publicly criticized announcing before the agreed window. OpenAI's Noam Brown responded that OpenAI had not been party to any formal coordination agreement, that it had been asked only verbally by an IMO board member to wait until after the closing ceremony, and that it had honored that specific request. The factual core that is not disputed is that OpenAI published first and Google DeepMind published a few days later. [3][4]
On legitimacy, DeepMind's head of reasoning, Thang Luong, argued that an official medal claim requires evaluation against the IMO's confidential marking guidelines, and that without such evaluation no medal claim could be made; he noted that a single additional point deducted would change the outcome from gold to silver. OpenAI maintained that its grading by former IMO medalists was rigorous and that its score sat comfortably within the gold band. Separately, the IMO did not establish formal rules for AI participation in 2025, and President Gregor Dolinar noted that the IMO could not validate the methods used, including the amount of compute, whether any human was involved, or whether the results could be reproduced. [3][4]
Mathematician Terence Tao cautioned more broadly that, without testing conditions standardized and disclosed in advance, AI and human results were difficult to compare. He observed that choices such as allowing extra time, providing curated training data, running many attempts in parallel, or submitting only the best of several runs could substantially change reported success rates, so headline comparisons to human contestants should be read with care. [3][4]
The 2025 results were significant for several reasons. The systems that reached gold were general-purpose reasoning models, not bespoke theorem provers, and they produced natural-language proofs directly from natural-language problems, eliminating the human-assisted formalization step that AlphaProof required in 2024. This suggested that frontier reasoning models had developed enough reliable, long-horizon proof ability to handle genuine olympiad mathematics, a domain long considered a stress test for machine reasoning. The achievement fit a broader 2024 to 2025 trajectory in which reasoning models, trained with reinforcement learning and given more inference-time compute, made rapid gains on hard problems. For Google DeepMind specifically, the result also represented a jump within a single year from a silver-level specialized system to a gold-level general one. [2][3]
Several caveats temper the milestone. Both systems failed Problem 6, the hardest problem, so neither solved the full paper. The improvement in raw score over 2024 was modest, from 28 to 35 points, even though the qualitative shift from formal, specialized methods to general natural-language reasoning was large. The two claims also rested on different evidential footings: DeepMind's was officially certified, while OpenAI's was independently graded but not certified by the IMO. And because both models were proprietary, neither result could be independently reproduced by outside parties, and the IMO explicitly declined to validate the underlying methods or compute. [3][4]
The 2025 IMO was not the only such demonstration that year. Other groups, including ByteDance and the startup Harmonic, reported strong results using formal, Lean-based approaches in the days following the competition. In September 2025, the trajectory continued at the International Collegiate Programming Contest (ICPC) World Finals, where an advanced version of Gemini reached gold-medal-level performance, solving 10 of the contest's problems, and OpenAI reported a perfect score of 12 of 12 using a combination of GPT-5 and an experimental reasoning model. Together with the IMO results, these events marked 2025 as the year in which AI systems first matched or exceeded elite human performance at the most prestigious mathematics and programming olympiads. [2][7]