Automated Theorem Proving

AI for Science

12 min read

Updated Jun 24, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 24, 2026

Fact-checked

In review queue

Sources

14 citations

Revision

v2 · 2,458 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Automated theorem proving (ATP) is the use of computer programs to find or check mathematical proofs, expressing both the statement and the proof in a formal_logic so that correctness can be decided mechanically rather than by expert judgment. The field began in 1956 with the Logic Theorist, the first artificial-intelligence program, and reached a public milestone in 2024 when Google DeepMind's alphaproof and AlphaGeometry 2 scored 28 of 42 points at the international_mathematical_olympiad (IMO), a silver-medal standard and the first time a machine reached medal level on the world's hardest mathematics competition ^[7]. In July 2025 an advanced version of Gemini with Deep Think went further, earning an officially certified gold medal with 35 of 42 points by solving five of the six problems in natural language within the 4.5-hour contest limit ^[9].

The field spans two broad traditions that have converged in the 2020s: classical symbolic systems that search for proofs using formal logic, and machine-learning systems that guide or generate proofs using neural networks. A closely related discipline, interactive_theorem_proving, keeps a human mathematician in the loop while a machine checks every step. ATP underpins formal_verification of hardware and software, sits at the center of ai_for_mathematics, and since 2024 a series of high-profile IMO results has made it one of the most visible benchmarks for the reasoning ability of AI systems.

What is automated theorem proving?

Automated theorem proving asks a precise question: given a set of axioms and a candidate statement, can a machine produce a proof that the statement follows from the axioms? Two things separate ATP from informal mathematics. First, proofs are expressed in a formal_logic so that correctness can be decided mechanically rather than by expert judgment. Second, the work splits between proof search (finding a proof, which is computationally hard) and proof checking (verifying a finished proof, which is comparatively cheap and trustworthy).

A persistent challenge is autoformalization: translating mathematics written in natural language into a formal language a machine can reason about. The inverse problem, producing human-readable output from a formal proof, matters for whether results are intelligible to mathematicians. These translation gaps explain why some 2025 milestones used fully formal proofs while others worked in natural language, a distinction discussed below.

How did classical theorem proving begin?

Classical ATP grew out of mathematical logic in the mid-20th century. The first program is widely regarded to be the Logic Theorist, written in 1955-1956 by Allen Newell, Herbert A. Simon, and J. Clifford Shaw at the RAND Corporation. Designed to prove theorems in the propositional calculus of Whitehead and Russell's Principia Mathematica, it proved 38 of the first 52 theorems in chapter two and found a shorter proof for one of them, demonstrating that a machine could carry out non-numerical reasoning ^[11].

A foundational technique is resolution, a single inference rule for first-order logic introduced by John Alan Robinson in his 1965 paper "A Machine-Oriented Logic Based on the Resolution Principle," published in the Journal of the ACM, which made systematic automated deduction practical ^[12]. Early systems at Argonne National Laboratory, notably the prover otter developed by william_mccune, applied resolution and equational reasoning to open problems in algebra and logic. Modern descendants are highly engineered saturation-based provers for first-order logic: Vampire, which has won at least one division of the CASC theorem-proving competition every year since 1999 for a record number of titles, and the E prover are the standard reference systems ^[13].

The most celebrated classical result is the proof of the Robbins conjecture, a question in boolean_algebra about whether all Robbins algebras are Boolean. The problem had resisted mathematicians for roughly six decades. On 10 October 1996, McCune's prover EQP, a specialized equational reasoning program, found a proof after about eight days of search on an IBM RS/6000 workstation ^[1]. The machine-generated proof was correct but cryptic, and mathematicians subsequently worked to make it human-readable.

The earliest landmark in computer-assisted mathematics was the four color theorem, the statement that any map can be colored with four colors so that no two adjacent regions share a color. On 21 June 1976, Kenneth Appel and Wolfgang Haken at the University of Illinois announced a proof that reduced the problem to checking 1,834 configurations by computer, work that required more than a month of machine time; it was the first major theorem to be proved with essential use of a computer, and its reliance on uncheckable computation provoked lasting debate about what counts as a proof ^[14].

A second major branch is the satisfiability family. SAT solvers decide whether a Boolean formula can be made true, and despite the problem being NP-complete, modern solvers handle instances with millions of variables. smt solvers (satisfiability modulo theories) extend this to richer domains such as arithmetic and arrays. SAT and SMT solvers are workhorses of formal_verification in industry, used to check microprocessors, compilers, and security-critical software, and they sit underneath many higher-level proof tools.

What are interactive theorem provers?

Interactive theorem provers, also called proof assistants, take a different approach from fully automatic search. A human writes the proof and the software checks each step against a small, trusted logical kernel, while automating routine subgoals. Because the kernel is the only component that must be trusted, these systems offer very high assurance. The major proof assistants include coq (renamed Rocq in 2025), isabelle, and lean.

Lean, created by Leonardo de Moura, has driven a surge of interest in formalized mathematics. Its community-maintained library mathlib is a large, unified formalization of undergraduate and graduate mathematics. The Fields Medalist terence_tao has been a prominent advocate. In late 2023 he led a collaborative project, with Yael Dillies and Bhavik Mehta, to formalize the proof of the polynomial Freiman-Ruzsa conjecture in Lean 4; the formalization succeeded in roughly three weeks and was greatly aided by Patrick Massot's Blueprint tool, which links a human-readable proof plan to the formal code ^[2]. Landmark formalizations in proof assistants include the four color theorem (re-verified end to end by Georges Gonthier in Coq in 2005) and the Feit-Thompson odd-order theorem (also in Coq) and the Kepler conjecture (in HOL Light and Isabelle).

How do neural theorem provers work?

Neural theorem proving applies machine learning to proof search. A recurring sub-task is premise selection: choosing, from a large library, the lemmas likely to be useful for a given goal. Language models attack the harder problem of generating proof steps directly.

An influential early system was gpt_f, introduced by OpenAI researchers in 2020. Built on a GPT-style decoder transformer and targeting the Metamath formalization language, GPT-f closed about 56 percent of proofs in a held-out test set, a substantial improvement over prior methods, and it contributed several new shortened proofs that were accepted into the main Metamath library ^[3]. Subsequent research combined large_language_model systems with reinforcement_learning and with formal checkers such as Lean, so that the verifier supplies a reliable reward signal during training. Prominent formal-proving models built on Lean 4 include:

deepseek_prover: DeepSeek-Prover-V2, released in April 2025 and built on DeepSeek-V3, decomposes problems into informal chain-of-thought reasoning and formal subgoals. Its 671-billion-parameter version reported an 88.9 percent pass rate on the miniF2F benchmark and solved 49 of 658 problems on PutnamBench ^[4].
kimina_prover: developed by Project Numina with Moonshot AI's Kimi team, Kimina-Prover used a reinforcement-learning pipeline and reported state-of-the-art results on miniF2F. Its developers highlighted that performance scaled with model size, a trend not previously observed for neural theorem provers ^[5].

These systems work in formal mathematics, meaning every output is machine-checked, which contrasts with general-purpose chatbots that produce natural-language arguments that may contain subtle errors.

When did AI reach medal level at the IMO?

The IMO became the field's flagship test in 2024 and 2025. The table below summarizes the principal results; the gold threshold at the IMO is typically around 35 of 42 points and varies by year.

Date	System	Developer	Result	Approach
Jan 2024	alphageometry	Google DeepMind	25/30 on IMO-AG-30 geometry set	Neuro-symbolic
Jul 2024	alphaproof + AlphaGeometry 2	Google DeepMind	28/42, silver-medal level (IMO 2024)	Formal (Lean), neuro-symbolic
Jul 2025	Gemini Deep Think	Google DeepMind	35/42, gold (officially graded)	Natural language
Jul 2025	Unnamed reasoning model	OpenAI	35/42, gold-level (self-graded)	Natural language
Jul 2025	Aristotle	Harmonic	5/6 problems, gold-level	Formal (Lean)

alphageometry, described in Nature on 17 January 2024, is a neuro-symbolic system that pairs a language model trained on synthetic data with a symbolic deduction engine. It solved 25 of 30 problems on a benchmark of past Olympiad geometry problems, up from 10 for the previous best system and approaching the average human gold medalist's score of 25.9 ^[6].

In July 2024, Google DeepMind reported that the combination of alphaproof and AlphaGeometry 2 reached a silver-medal score of 28 out of 42 on the 2024 IMO problems, one point below that year's gold cutoff of 29 (which 58 of 609 contestants met). AlphaProof, an AlphaZero-inspired agent that learns to prove statements in Lean by training on millions of auto-formalized problems, solved two algebra problems and one number theory problem (including the contest's hardest, solved by only five human contestants), while AlphaGeometry 2 solved the geometry problem. Solutions were graded by mathematicians and earned full marks on every problem solved, though the system solved one problem within minutes and took up to three days on others, well beyond the human time limit ^[7]. The mathematician Timothy Gowers, who helped verify the output, commented: "The fact that the program can come up with a non-obvious construction like this is very impressive, and well beyond what I thought was state of the art" ^[7]. This work was published in Nature in November 2025 ^[7].

The 2025 IMO, held in July in Australia, produced gold-level performances from three systems using two different paradigms, and the announcements became a point of public dispute ^[8]^[9]:

Google DeepMind entered an advanced version of Gemini with Deep Think, a general-purpose natural-language model. It scored 35 out of 42, solving five of six problems, working end to end in natural language from the official problem statements within the 4.5-hour limit. DeepMind's solutions were officially graded and certified by the IMO, and the company says it delayed its announcement at the organizers' request so as not to overshadow the human contestants. IMO President Gregor Dolinar confirmed the result, stating: "We can confirm that Google DeepMind has reached the much-desired milestone, earning 35 out of a possible 42 points. Their solutions were astonishing in many respects. IMO graders found them to be clear, precise and most of them easy to follow" ^[9].
OpenAI evaluated an unnamed experimental reasoning model on the same problems under self-imposed contest conditions (two sessions of 4.5 hours, no internet or external tools) and reported the same 35-point, five-of-six result. OpenAI was not an official entrant; it did not submit its answers to the IMO and instead engaged its own graders, and it published its results earlier than DeepMind. This drew criticism, with commentators and IMO-affiliated figures noting that OpenAI had communicated little with the organizers and announced before the celebration of human contestants had concluded. The score and methodology were therefore self-reported rather than IMO-certified.
Harmonic, an AI startup, reported that its model aristotle also reached gold-level performance, solving five of six problems. Unlike the DeepMind and OpenAI systems, Aristotle worked in fully formal mathematics: it translated the problems into Lean 4 and produced machine-checkable proofs, which Harmonic published, removing the need for human grading of those solutions ^[10].

The IMO president noted at the time that organizers could not independently verify how much compute the AI systems used or rule out human involvement, a caution that applies most directly to the unofficial, self-graded entries. A reported claim by Harmonic that Aristotle produced a Lean-verified solution to a long-open Erdos problem appeared in its later technical writing and is a separate result from the IMO performance.

Why does formal verification matter?

The convergence of classical and neural methods has made formal proof a practical tool and a leading benchmark for machine reasoning. Because formally verified proofs are machine-checked, results such as Harmonic's are trustworthy in a way that natural-language outputs are not, which is why several researchers, including Terence Tao, argue that pairing language models with proof assistants is a path to reliable mathematical AI. The natural-language gold-level results from Gemini Deep Think and OpenAI's model demonstrated strong reasoning but rely on human or external grading, leaving the formal-versus-informal distinction central to how such claims should be interpreted.

Beyond competition mathematics, the same techniques feed back into formal_verification of software and hardware, and into the broader project of building large, machine-checked libraries of mathematics such as mathlib. Open questions remain in autoformalization, in producing human-intelligible proofs, in scaling to research-level mathematics rather than Olympiad puzzles, and in independently verifying the compute and conditions behind headline results.

References

Bill McCune. "Solution of the Robbins Problem." Journal of Automated Reasoning, 1997. Argonne National Laboratory. https://www.cs.unm.edu/~mccune/papers/robbins/ ↩
Terence Tao. "Formalizing the proof of PFR in Lean4 using Blueprint: a short tour." What's New, 18 November 2023. https://terrytao.wordpress.com/2023/11/18/formalizing-the-proof-of-pfr-in-lean4-using-blueprint-a-short-tour/ ↩
Stanislas Polu and Ilya Sutskever. "Generative Language Modeling for Automated Theorem Proving." arXiv:2009.03393, 2020. https://arxiv.org/abs/2009.03393 ↩
DeepSeek-AI. "DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition." arXiv:2504.21801, 2025. https://arxiv.org/pdf/2504.21801 ↩
Project Numina and Kimi Team. "Kimina-Prover Preview: Towards Large Formal Reasoning Models." arXiv:2504.11354, 2025. https://huggingface.co/blog/AI-MO/kimina-prover ↩
Trinh, Wu, Le, He, Luong. "Solving olympiad geometry without human demonstrations." Nature, 17 January 2024. https://www.nature.com/articles/s41586-023-06747-5 ↩
Google DeepMind. "AI achieves silver-medal standard solving International Mathematical Olympiad problems," 25 July 2024; and "Olympiad-level formal mathematical reasoning with reinforcement learning," Nature, November 2025. https://deepmind.google/blog/ai-solves-imo-problems-at-silver-medal-level/ ↩
"Did Google DeepMind or OpenAI win gold at the world's most prestigious math competition?" Euronews, 22 July 2025. https://www.euronews.com/next/2025/07/22/did-google-deepmind-or-openai-win-gold-at-the-worlds-most-prestigious-math-competition ↩
Google DeepMind. "Advanced version of Gemini with Deep Think officially achieves gold-medal standard at the International Mathematical Olympiad," 21 July 2025. https://deepmind.google/blog/advanced-version-of-gemini-with-deep-think-officially-achieves-gold-medal-standard-at-the-international-mathematical-olympiad/ ↩
The Harmonic Team. "Aristotle: IMO-level Automated Theorem Proving." arXiv:2510.01346, 2025. https://arxiv.org/pdf/2510.01346 ↩
Allen Newell and Herbert A. Simon. "The Logic Theory Machine: A Complex Information Processing System." RAND Corporation, 1956; see also Stanford Encyclopedia of Philosophy, "The History of AI." https://plato.stanford.edu/entries/computing-history/ ↩
J. A. Robinson. "A Machine-Oriented Logic Based on the Resolution Principle." Journal of the ACM, vol. 12, no. 1, 1965, pp. 23-41. https://dl.acm.org/doi/10.1145/321250.321253 ↩
Laura Kovacs and Andrei Voronkov. "First-Order Theorem Proving and Vampire." Computer Aided Verification (CAV), 2013; CASC competition results. https://vprover.github.io/ ↩
Kenneth Appel and Wolfgang Haken. "Every Planar Map is Four Colorable." University of Illinois, announced 21 June 1976. https://en.wikipedia.org/wiki/Four_color_theorem ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

AlphaGeometry Christian Szegedy FrontierMath Harmonic (AI)Lean (Theorem Prover)Math Inc and Gauss Mathlib NuminaMath PutnamBench Stanislas Polu

What is automated theorem proving?

How did classical theorem proving begin?

What are interactive theorem provers?

How do neural theorem provers work?

When did AI reach medal level at the IMO?

Why does formal verification matter?

References

Improve this article

Related Articles

Academic Research

GeoAI

Science ChatGPT Plugins

Space

Weather

Weather ChatGPT Plugins

What links here

Related Articles

Academic Research

GeoAI

Science ChatGPT Plugins

Space

Weather

Weather ChatGPT Plugins

What links here