AlphaProof is a formal mathematical reasoning system developed by Google DeepMind, announced on July 25, 2024. The system uses reinforcement learning to find and verify rigorous mathematical proofs inside the Lean 4 interactive theorem prover. When combined with AlphaGeometry 2 on the 2024 International Mathematical Olympiad (IMO), AlphaProof contributed to a combined score of 28 out of 42 points, the first time an AI system reached silver-medal level at the competition.
Unlike earlier AI math tools that generated plausible-sounding but unverified reasoning, AlphaProof operates entirely within a formal language where every step is machine-checked. This means its proofs cannot contain hidden errors; if the Lean verifier accepts a proof, it is correct by construction.
Mathematics has two distinct cultures. The informal culture, practiced by most working mathematicians, produces proofs written in natural language, with intuitive reasoning and occasional hand-waving over details that are considered obvious. The formal culture, supported by tools like Lean, Coq, and Isabelle, requires every inference to be spelled out in a logical language that a computer can verify from first principles.
Formal proofs carry a guarantee that informal proofs do not: they are correct or they do not compile. This property makes formal methods attractive for safety-critical software, hardware verification, and cryptographic protocols. For AI, the formal setting offers something equally valuable: an exact, binary reward signal. A proof step is either valid or invalid, with no need for a human judge.
The obstacle has always been data. Informal mathematical writing is abundant, but human-authored formal proofs are scarce. Lean's main community library, Mathlib, contains over two million lines of formalized mathematics, which is impressive for a community project but small compared to the training corpora available in natural language domains. Systems trained only on existing Lean proofs quickly hit a ceiling.
Before AlphaProof, neural theorem provers could handle university-level lemmas but struggled with competition mathematics, which requires stringing together novel sequences of nontrivial steps without a clear roadmap. The 2024 IMO problems, in particular, sit near the frontier of what high school students can do with months of preparation, making them a meaningful benchmark for AI reasoning.
AlphaProof combines three components: a language model trained on formal mathematics, a reinforcement learning loop modeled on AlphaZero, and an autoformalization pipeline that converts natural-language problem statements into Lean.
At its core, AlphaProof uses an encoder-decoder language model trained on code and mathematical data. The pretraining uses a mixture of next-token prediction and span reconstruction tasks, along with dropout for regularization. The architecture is optimized for the Lean proof assistant environment, supporting large batch sizes needed for the scale of training. The model is not trained on general web text; its domain is specifically mathematical and formal content.
After pretraining, the model undergoes supervised fine-tuning on approximately 300,000 state-tactic pairs extracted from human-authored proofs in the Mathlib library, amounting to roughly five million tokens of tactics. This phase teaches the model the local tactics available in Lean and what kinds of moves tend to make progress on which kinds of goals.
AlphaProof adapts the AlphaZero reinforcement learning algorithm, which DeepMind originally developed for board games. In the game setting, AlphaZero uses a neural network to evaluate positions and a Monte Carlo tree search (MCTS) to explore future moves; self-play generates training data without any human supervision. AlphaProof reframes proof search as a similar game: the state is the current proof goal, the moves are tactic applications, and a win occurs when the proof compiles.
The system interacts with a custom Lean 4 environment that executes tactics, handles proof states, and supports the operations needed for large-scale tree search. At each node in the search tree, AlphaProof's neural network evaluates candidate tactics and estimates the likelihood of eventual success. The MCTS guides exploration toward promising branches, with the network's predictions improving as more proofs are discovered.
A notable feature of the search is how it handles product nodes, which arise when a proof goal splits into multiple independent subgoals. When the tree passes through such a node, AlphaProof back-propagates the value of the hardest branch, since the overall proof succeeds only when every subgoal is resolved. Proven subgoals are marked and not revisited, allowing the search to concentrate effort on what remains unsolved.
Every verified proof is fed back into training, updating the model's weights to make similar reasoning steps easier to find on future problems. This expert iteration loop generates progressively harder training examples, since the system can now prove harder problems and add those to its curriculum.
The training pipeline requires a large supply of formal problems, but human-written Lean problems are limited. AlphaProof solves this through autoformalization: a fine-tuned Gemini model translates natural-language mathematics into Lean 4 statements.
The autoformalization model is trained on triplets of (informal statement, formalization reasoning chain, formal Lean statement). For each source problem, the model generates multiple distinct Lean formalizations stochastically, producing roughly 80 times more formal variants than the original problem count. Even formalizations that are technically invalid as Lean statements still provide useful training signal, since they represent valid mathematical ideas that the prover model can learn from.
This pipeline was applied to a large corpus of mathematical problems spanning many topics and difficulty levels, building a library of formal problems used for the RL training phase. The system trained by proving or disproving millions of such problems over a period of weeks before the 2024 IMO competition.
During the actual competition, Gemini was also used to generate candidate numerical answers for problems with specific numerical solutions. AlphaProof would then attempt to prove that these candidate answers were correct, or disprove them if they were not, effectively using Gemini as an answer-generation oracle while AlphaProof handled the verification.
One of the more unusual aspects of AlphaProof is that training continues during inference on difficult problems. When given a hard problem, the system does not simply run a fixed forward pass. Instead, it generates variations of the target theorem, creates a curriculum of easier sub-problems and related conjectures, and runs additional RL iterations specifically focused on that problem. This allows the model to effectively specialize for the problem at hand, discovering intermediate lemmas that are useful even if they were never seen in pretraining.
For IMO 2024, this test-time RL was applied over extended compute sessions. Some problems were resolved in minutes; others required up to three days of continuous search.
Lean 4 is the version of the Lean theorem prover that AlphaProof operates in. Lean was originally developed at Microsoft Research and is now maintained by an open-source community. It is both an interactive theorem prover and a functional programming language based on dependent type theory, specifically the calculus of inductive constructions.
Lean's type system enforces logical consistency at the language level: a term inhabiting a type corresponds to a proof of the proposition that type encodes, a correspondence known as the Curry-Howard isomorphism. This means that if Lean accepts a program as well-typed, the corresponding mathematical statement has been proven.
Mathlib is Lean's comprehensive community-maintained mathematical library. As of 2024, it contains over two million lines of formalized mathematics covering algebra, analysis, topology, number theory, combinatorics, probability, and many other areas. Mathlib represents years of collective effort by hundreds of contributors and has become one of the most extensive formal mathematics libraries in any proof assistant.
For AlphaProof, Mathlib serves as both a curriculum and a toolbox. The model was fine-tuned on proofs extracted from Mathlib and can leverage all of Mathlib's existing definitions, lemmas, and tactics during search. When a proof step calls a Mathlib lemma, that call does not need to be re-derived from scratch; the prover simply applies the existing theorem.
Terence Tao, the Fields Medal-winning mathematician, noted the significance of this approach when commenting on AlphaProof: the formal setting means that any proof the system produces is unambiguously correct, which is not always true of AI-generated informal proofs that can sound plausible while containing subtle errors.
The 65th International Mathematical Olympiad took place in Bath, England, in July 2024. A total of 609 contestants from 108 countries competed. Each contestant faced six problems worth seven points each, for a maximum of 42 points. Medal cutoffs that year were: gold at 29 points or above (58 contestants), silver from 22 to 28 points (123 contestants), and bronze from 16 to 21 points (145 contestants).
The combined AlphaProof and AlphaGeometry 2 system solved four of the six problems, earning 28 points total, placing it within the silver-medal range and one point below the gold-medal threshold.
| Problem | Topic | Points available | Solved by | Human solve rate |
|---|---|---|---|---|
| P1 | Algebra | 7 | AlphaProof | 413 / 609 (68%) |
| P2 | Number theory | 7 | AlphaProof | 156 / 609 (26%) |
| P3 | Combinatorics | 7 | Not solved | Very low |
| P4 | Geometry | 7 | AlphaGeometry 2 | Moderate |
| P5 | Combinatorics | 7 | Not solved | Very low |
| P6 | Algebra | 7 | AlphaProof | 5 / 609 (< 1%) |
AlphaProof solved P1, P2, and P6. AlphaGeometry 2, a separate system specialized for Euclidean geometry, solved P4. Problems P3 and P5, both in combinatorics, remained unsolved.
P6 was the most remarkable result. Described as one of the hardest problems in the competition's recent history, it was solved by only five human contestants out of 609. AlphaProof found a valid Lean proof for P6, though the approach it used was characterized by extensive case analysis rather than a compact elegant argument. The solution worked by checking many cases systematically, which is a valid mathematical strategy but not the kind of clean argument a human mathematician would typically aim for.
P2, a number theory problem about sequences of greatest common divisors, was handled by AlphaProof using analysis of behavior modulo ab+1. While the underlying mathematical insight was sound, the resulting Lean code was described by human reviewers as "quite ugly" compared to what a human might write.
P1, an algebra problem that 68% of contestants solved, was completed relatively quickly, consistent with its classification as the competition's most approachable problem.
AlphaProof did not work under the same time constraints as human contestants, who have 4.5 hours across two days. The system ran for up to three days on the hardest problems, using substantial compute. This is a significant caveat when comparing its performance to human results.
The solutions were evaluated by Professor Sir Timothy Gowers, a Fields Medal winner and IMO gold medalist, and Dr. Joseph Myers, a two-time IMO gold medalist who chaired the IMO 2024 Problem Selection Committee. Both confirmed that AlphaProof's solutions were valid under the standard IMO scoring rubric. Professor Gowers said the results were "very impressive, and well beyond what I thought was state of the art."
The table below places the AlphaProof-AlphaGeometry 2 combined score in context relative to the 2024 IMO medal cutoffs.
| Performance level | Score (out of 42) | Notes |
|---|---|---|
| Gold medal (human) | 29+ | 58 contestants in 2024 |
| AlphaProof + AlphaGeometry 2 | 28 | One point below gold cutoff |
| Silver medal (human) | 22-28 | 123 contestants in 2024 |
| Bronze medal (human) | 16-21 | 145 contestants in 2024 |
| Honorable mention (human) | 7-15 | Solved at least one problem |
The silver-medal result placed the combined AI system roughly in the top 30% of all contestants. However, the comparison requires care. Human contestants solve all six problems simultaneously within 4.5 hours total. AlphaProof ran for up to three days on individual problems, used substantial computational resources, and required human experts to translate the problems from natural language into Lean before the system could process them. Under identical time and resource constraints, the human-comparable score would likely be lower.
The two unsolved problems, P3 and P5, were combinatorics problems that required constructing novel organizational structures and reasoning about strategies. These types of problems seem to be a persistent weakness of the formal-reasoning approach, as they require inventing intermediate abstractions that are not naturally suggested by the problem statement.
In February 2025, Google DeepMind published the full AlphaProof research in Nature under the title "Olympiad-level formal mathematical reasoning with reinforcement learning." The paper provided a detailed account of the architecture, training pipeline, and results.
At IMO 2025, Google DeepMind fielded a different approach: an advanced version of Gemini with Deep Think capability. This system operated entirely in natural language, without requiring expert translation into Lean, and solved five of the six IMO 2025 problems within the standard 4.5-hour competition time limit, achieving a score of 35 out of 42 points. The IMO 2025 result was certified by official IMO coordinators using the same criteria applied to student solutions, and the IMO President described the solutions as "astonishing... clear, precise and easy to follow."
The contrast between 2024 and 2025 illustrates two different philosophies. AlphaProof's approach relies on formal verification: proofs are machine-checked and guaranteed correct, but the system requires formalization infrastructure and extended compute. The 2025 Gemini approach trades formal guarantees for speed and generality, producing rigorous-sounding natural language proofs within a competition-compatible time frame.
AlphaProof's IMO performance is a demonstration of a capability with broader applications in mathematics and software.
Formal theorem proving systems can serve as verification backends for mathematical research. A mathematician can state a conjecture in Lean and use AlphaProof-like systems to search for proofs of supporting lemmas, check that a proposed proof strategy does not have gaps, or verify that a known result from one area of mathematics applies in another context. The formal output provides a machine-checkable certificate of correctness.
Google DeepMind subsequently published a set of formalized conjectures in a GitHub repository, explicitly inviting the formal mathematics community to use these as benchmarks and training targets. This signals an intent to engage with research mathematics beyond competition problems.
Formal methods have long been used in safety-critical software and hardware design. The same mechanisms that AlphaProof uses to verify mathematical proofs apply directly to verifying that a program or circuit satisfies a formal specification. AI-assisted formal verification could reduce the cost of verifying complex systems that are currently too large to check manually.
In multi-stage discovery pipelines, AlphaProof functions as a verification backend. A discovery system identifies a conjecture or candidate result; an informal reasoning system sketches a proof strategy; AlphaProof formalizes and verifies the argument within Lean. Each successful proof extends the formal library available to future reasoning, creating a compounding effect as the library grows.
Formal proof assistants have been proposed as educational tools that give students immediate feedback on proof attempts. An AlphaProof-like system could help students understand where a proof attempt breaks down and what additional lemmas are needed, without requiring a human instructor to check each step.
AlphaProof has several significant limitations that affect its practical usefulness compared to informal mathematical reasoning.
Formalization requirement. As of its 2024 announcement, AlphaProof required human experts to translate competition problems from natural language into Lean before the system could process them. This is not a quick task; the IMO 2024 geometry problem P5 (Turbo the Snail) required DeepMind's team over a day to formalize because it involved spatial reasoning about a grid that is difficult to express naturally in Lean. The autoformalization pipeline addresses this partially but does not yet handle all problem types without human oversight.
Compute requirements. AlphaProof used far more computational resources and time than human contestants. Problems that took human contestants hours required the system days of continuous search on substantial hardware. The silver-medal score was achieved under conditions that are not comparable to competition constraints.
Weakness on combinatorics. The two unsolved IMO 2024 problems were both combinatorics problems. Combinatorics often requires inventing the right abstraction or proof structure, which is not suggested by the problem statement. AlphaProof's search-based approach works well when useful intermediate lemmas can be discovered incrementally but struggles when the key insight requires a conceptual leap to a new organizational framework.
Proof style. AlphaProof's proofs, while correct, are not always written in a style that mathematicians find illuminating. The P6 solution used extensive case enumeration rather than a general argument. Such proofs verify the truth of a statement but do not necessarily convey why it is true, which limits their value for mathematical understanding.
Library coverage. Novel mathematical territory that falls outside Mathlib's existing coverage requires extending the library before AlphaProof can reason about it effectively. Adding new mathematical areas to Mathlib is a significant undertaking that still requires substantial human effort.
No informal reasoning. AlphaProof does not produce informal mathematical intuition. It finds formal proofs but cannot explain its reasoning in the way a mathematician would explain why a proof idea works. This makes it difficult to use as a collaborative tool in the way that informal AI assistants can be.