Automated Theorem Proving
Last reviewed
Sources
14 citations
Review status
Source-backed
Revision
v2 ยท 2,458 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
14 citations
Review status
Source-backed
Revision
v2 ยท 2,458 words
Add missing citations, update stale details, or suggest a clearer explanation.
Automated theorem proving (ATP) is the use of computer programs to find or check mathematical proofs, expressing both the statement and the proof in a formal_logic so that correctness can be decided mechanically rather than by expert judgment. The field began in 1956 with the Logic Theorist, the first artificial-intelligence program, and reached a public milestone in 2024 when Google DeepMind's alphaproof and AlphaGeometry 2 scored 28 of 42 points at the international_mathematical_olympiad (IMO), a silver-medal standard and the first time a machine reached medal level on the world's hardest mathematics competition [7]. In July 2025 an advanced version of Gemini with Deep Think went further, earning an officially certified gold medal with 35 of 42 points by solving five of the six problems in natural language within the 4.5-hour contest limit [9].
The field spans two broad traditions that have converged in the 2020s: classical symbolic systems that search for proofs using formal logic, and machine-learning systems that guide or generate proofs using neural networks. A closely related discipline, interactive_theorem_proving, keeps a human mathematician in the loop while a machine checks every step. ATP underpins formal_verification of hardware and software, sits at the center of ai_for_mathematics, and since 2024 a series of high-profile IMO results has made it one of the most visible benchmarks for the reasoning ability of AI systems.
Automated theorem proving asks a precise question: given a set of axioms and a candidate statement, can a machine produce a proof that the statement follows from the axioms? Two things separate ATP from informal mathematics. First, proofs are expressed in a formal_logic so that correctness can be decided mechanically rather than by expert judgment. Second, the work splits between proof search (finding a proof, which is computationally hard) and proof checking (verifying a finished proof, which is comparatively cheap and trustworthy).
A persistent challenge is autoformalization: translating mathematics written in natural language into a formal language a machine can reason about. The inverse problem, producing human-readable output from a formal proof, matters for whether results are intelligible to mathematicians. These translation gaps explain why some 2025 milestones used fully formal proofs while others worked in natural language, a distinction discussed below.
Classical ATP grew out of mathematical logic in the mid-20th century. The first program is widely regarded to be the Logic Theorist, written in 1955-1956 by Allen Newell, Herbert A. Simon, and J. Clifford Shaw at the RAND Corporation. Designed to prove theorems in the propositional calculus of Whitehead and Russell's Principia Mathematica, it proved 38 of the first 52 theorems in chapter two and found a shorter proof for one of them, demonstrating that a machine could carry out non-numerical reasoning [11].
A foundational technique is resolution, a single inference rule for first-order logic introduced by John Alan Robinson in his 1965 paper "A Machine-Oriented Logic Based on the Resolution Principle," published in the Journal of the ACM, which made systematic automated deduction practical [12]. Early systems at Argonne National Laboratory, notably the prover otter developed by william_mccune, applied resolution and equational reasoning to open problems in algebra and logic. Modern descendants are highly engineered saturation-based provers for first-order logic: Vampire, which has won at least one division of the CASC theorem-proving competition every year since 1999 for a record number of titles, and the E prover are the standard reference systems [13].
The most celebrated classical result is the proof of the Robbins conjecture, a question in boolean_algebra about whether all Robbins algebras are Boolean. The problem had resisted mathematicians for roughly six decades. On 10 October 1996, McCune's prover EQP, a specialized equational reasoning program, found a proof after about eight days of search on an IBM RS/6000 workstation [1]. The machine-generated proof was correct but cryptic, and mathematicians subsequently worked to make it human-readable.
The earliest landmark in computer-assisted mathematics was the four color theorem, the statement that any map can be colored with four colors so that no two adjacent regions share a color. On 21 June 1976, Kenneth Appel and Wolfgang Haken at the University of Illinois announced a proof that reduced the problem to checking 1,834 configurations by computer, work that required more than a month of machine time; it was the first major theorem to be proved with essential use of a computer, and its reliance on uncheckable computation provoked lasting debate about what counts as a proof [14].
A second major branch is the satisfiability family. SAT solvers decide whether a Boolean formula can be made true, and despite the problem being NP-complete, modern solvers handle instances with millions of variables. smt solvers (satisfiability modulo theories) extend this to richer domains such as arithmetic and arrays. SAT and SMT solvers are workhorses of formal_verification in industry, used to check microprocessors, compilers, and security-critical software, and they sit underneath many higher-level proof tools.
Interactive theorem provers, also called proof assistants, take a different approach from fully automatic search. A human writes the proof and the software checks each step against a small, trusted logical kernel, while automating routine subgoals. Because the kernel is the only component that must be trusted, these systems offer very high assurance. The major proof assistants include coq (renamed Rocq in 2025), isabelle, and lean.
Lean, created by Leonardo de Moura, has driven a surge of interest in formalized mathematics. Its community-maintained library mathlib is a large, unified formalization of undergraduate and graduate mathematics. The Fields Medalist terence_tao has been a prominent advocate. In late 2023 he led a collaborative project, with Yael Dillies and Bhavik Mehta, to formalize the proof of the polynomial Freiman-Ruzsa conjecture in Lean 4; the formalization succeeded in roughly three weeks and was greatly aided by Patrick Massot's Blueprint tool, which links a human-readable proof plan to the formal code [2]. Landmark formalizations in proof assistants include the four color theorem (re-verified end to end by Georges Gonthier in Coq in 2005) and the Feit-Thompson odd-order theorem (also in Coq) and the Kepler conjecture (in HOL Light and Isabelle).
Neural theorem proving applies machine learning to proof search. A recurring sub-task is premise selection: choosing, from a large library, the lemmas likely to be useful for a given goal. Language models attack the harder problem of generating proof steps directly.
An influential early system was gpt_f, introduced by OpenAI researchers in 2020. Built on a GPT-style decoder transformer and targeting the Metamath formalization language, GPT-f closed about 56 percent of proofs in a held-out test set, a substantial improvement over prior methods, and it contributed several new shortened proofs that were accepted into the main Metamath library [3]. Subsequent research combined large_language_model systems with reinforcement_learning and with formal checkers such as Lean, so that the verifier supplies a reliable reward signal during training. Prominent formal-proving models built on Lean 4 include:
These systems work in formal mathematics, meaning every output is machine-checked, which contrasts with general-purpose chatbots that produce natural-language arguments that may contain subtle errors.
The IMO became the field's flagship test in 2024 and 2025. The table below summarizes the principal results; the gold threshold at the IMO is typically around 35 of 42 points and varies by year.
| Date | System | Developer | Result | Approach |
|---|---|---|---|---|
| Jan 2024 | alphageometry | Google DeepMind | 25/30 on IMO-AG-30 geometry set | Neuro-symbolic |
| Jul 2024 | alphaproof + AlphaGeometry 2 | Google DeepMind | 28/42, silver-medal level (IMO 2024) | Formal (Lean), neuro-symbolic |
| Jul 2025 | Gemini Deep Think | Google DeepMind | 35/42, gold (officially graded) | Natural language |
| Jul 2025 | Unnamed reasoning model | OpenAI | 35/42, gold-level (self-graded) | Natural language |
| Jul 2025 | Aristotle | Harmonic | 5/6 problems, gold-level | Formal (Lean) |
alphageometry, described in Nature on 17 January 2024, is a neuro-symbolic system that pairs a language model trained on synthetic data with a symbolic deduction engine. It solved 25 of 30 problems on a benchmark of past Olympiad geometry problems, up from 10 for the previous best system and approaching the average human gold medalist's score of 25.9 [6].
In July 2024, Google DeepMind reported that the combination of alphaproof and AlphaGeometry 2 reached a silver-medal score of 28 out of 42 on the 2024 IMO problems, one point below that year's gold cutoff of 29 (which 58 of 609 contestants met). AlphaProof, an AlphaZero-inspired agent that learns to prove statements in Lean by training on millions of auto-formalized problems, solved two algebra problems and one number theory problem (including the contest's hardest, solved by only five human contestants), while AlphaGeometry 2 solved the geometry problem. Solutions were graded by mathematicians and earned full marks on every problem solved, though the system solved one problem within minutes and took up to three days on others, well beyond the human time limit [7]. The mathematician Timothy Gowers, who helped verify the output, commented: "The fact that the program can come up with a non-obvious construction like this is very impressive, and well beyond what I thought was state of the art" [7]. This work was published in Nature in November 2025 [7].
The 2025 IMO, held in July in Australia, produced gold-level performances from three systems using two different paradigms, and the announcements became a point of public dispute [8][9]:
The IMO president noted at the time that organizers could not independently verify how much compute the AI systems used or rule out human involvement, a caution that applies most directly to the unofficial, self-graded entries. A reported claim by Harmonic that Aristotle produced a Lean-verified solution to a long-open Erdos problem appeared in its later technical writing and is a separate result from the IMO performance.
The convergence of classical and neural methods has made formal proof a practical tool and a leading benchmark for machine reasoning. Because formally verified proofs are machine-checked, results such as Harmonic's are trustworthy in a way that natural-language outputs are not, which is why several researchers, including Terence Tao, argue that pairing language models with proof assistants is a path to reliable mathematical AI. The natural-language gold-level results from Gemini Deep Think and OpenAI's model demonstrated strong reasoning but rely on human or external grading, leaving the formal-versus-informal distinction central to how such claims should be interpreted.
Beyond competition mathematics, the same techniques feed back into formal_verification of software and hardware, and into the broader project of building large, machine-checked libraries of mathematics such as mathlib. Open questions remain in autoformalization, in producing human-intelligible proofs, in scaling to research-level mathematics rather than Olympiad puzzles, and in independently verifying the compute and conditions behind headline results.