AlphaGeometry is a neuro-symbolic artificial intelligence system developed by Google DeepMind that solves competition-level geometry problems at a standard comparable to human Olympiad gold medalists. The original system was published on January 17, 2024 in the journal Nature under the title "Solving olympiad geometry without human demonstrations," authored by Trieu Trinh, Yuhuai Wu, Quoc Le, He He, and Thang Luong. AlphaGeometry was the first AI system to approach human expert-level performance on International Mathematical Olympiad (IMO) geometry problems, solving 25 of 30 benchmark problems within the standard time limit. A substantially upgraded version, AlphaGeometry 2, was deployed in July 2024 for the actual IMO competition, where it solved Problem 4 within 19 seconds and contributed to a silver-medal-level combined score alongside the separate AlphaProof system.
Formal reasoning over geometric figures has challenged computer scientists for decades. While AI systems have performed well at games, protein structure prediction (AlphaFold), and natural language tasks, mathematical reasoning requires a different kind of capability: the system must not only produce an answer but construct a logically valid proof that can be verified step by step.
Geometry competitions such as the International Mathematical Olympiad present particularly hard instances of this challenge. IMO problems are selected to resist standard computational approaches and require non-obvious auxiliary constructions, where a solver must introduce new points, lines, or circles that do not appear in the original problem statement before a proof becomes tractable. A human mathematician drawing on years of training can recognize which constructions are likely to help; prior automated systems could not.
Earlier automated geometry solvers relied on algebraic approaches such as Wu's method and the Groebner basis method, or on purely deductive systems called synthetic provers. These tools solved some problems efficiently but plateaued well below the difficulty of IMO geometry. The system that held the record before AlphaGeometry solved 10 of the 30 problems in the IMO-AG-30 benchmark; a human gold medalist averaged 25.9 out of 30.
AlphaGeometry was announced on January 17, 2024, simultaneously with the publication of the Nature paper. The core insight behind the system is that the two main obstacles to automated geometry proving, namely the inability to make good auxiliary constructions and the need for human-annotated training data, can both be overcome by generating vast amounts of synthetic geometric data without human input.
The system solved 25 of 30 problems in the IMO-AG-30 benchmark, placing it between the performance of a human silver medalist (typically around 22 problems) and the average gold medalist (25.9 problems). The previous automated record was 10 problems.
AlphaGeometry combines two components that work in an interleaved loop:
Symbolic deduction engine (DDAR). The symbolic component is an algorithm called the Deductive Database Angle-Ratio (DDAR) engine. DDAR computes the deduction closure of a geometric configuration: given a set of initial facts about points, lines, circles, and angles, it exhaustively applies a fixed set of inference rules to derive all additional facts that follow logically. DDAR is complete within its rule set but cannot introduce new geometric objects. When it can derive no further facts, it halts.
Neural language model. When DDAR halts without having reached the proof goal, the neural language model takes over. It reads the current state of the proof attempt (expressed in a formal geometric language) and proposes an auxiliary construction: a new point, line, or circle to add to the figure. Once that construction is added, DDAR resumes. The two components alternate until either a proof is found or a search budget is exhausted.
The language model is a transformer-based architecture trained entirely on synthetic geometric data. It operates on sequences of symbols in a purpose-built formal language that describes geometric objects and relationships, rather than on natural language or images.
One of the most important contributions of the AlphaGeometry paper is its method for generating training data without any human-annotated proofs. The process has two stages.
First, the team used highly parallelized computing to generate one billion random geometric configurations. For each configuration, DDAR computed all derivable facts, producing a large set of (premises, conclusions) pairs. This is "forward generation": derive everything that is true about a randomly sampled figure.
Second, the team ran a traceback procedure. For each conclusion that required no auxiliary construction, the traceback identifies the minimal subset of premises needed to prove it, yielding a clean training example. For conclusions that required an auxiliary construction, the procedure identifies which points in the figure served as constructs and generates the corresponding proof steps.
This process yielded roughly 100 million unique geometry theorems and their proofs across a wide range of difficulty levels. Nine million of these examples involved auxiliary constructions, which are the hardest cases. The final training dataset was filtered to remove near-duplicates.
The language model was trained from scratch on this synthetic corpus. It never saw a human-written proof; all of its knowledge of geometric reasoning came from the synthetically generated data.
AlphaGeometry operates in a formal language developed specifically for Euclidean plane geometry. This language can express:
All geometric facts are encoded as symbolic expressions in this language. The language model takes a sequence of such expressions as input and outputs the description of a new geometric object as a token sequence.
The language was deliberately restricted in scope to make the formal system tractable. One consequence is that the original AlphaGeometry language could only express approximately 66% of IMO geometry problems from 2000 to 2024. Problems involving geometric inequalities, area calculations, or combinatorial geometry fall outside this scope entirely.
The 30-problem benchmark (IMO-AG-30) was constructed by selecting all geometry problems from past IMO competitions that could be expressed in the AlphaGeometry formal language. The benchmark was used to compare AlphaGeometry against prior systems and against human performance levels.
| System | IMO-AG-30 solved | Notes |
|---|---|---|
| Previous state-of-the-art (DDAR alone) | 14 | Symbolic only, no auxiliary constructions |
| Previous published automated solver | 10 | Best prior result at time of publication |
| AlphaGeometry | 25 | Within standard time limit |
| Average IMO gold medalist | 25.9 | Human reference |
| Average IMO silver medalist | ~22 | Human reference |
AlphaGeometry produced human-readable proofs for all solved problems. The proofs were verified automatically by the symbolic engine and independently checked by mathematicians, including at least one IMO gold medalist.
DDAR stands for Deductive Database with Algebraic Rules. It maintains a database of known facts about the current geometric configuration and iteratively applies a fixed set of deduction rules until no new facts can be derived (the fixed point of the rule set). The rules cover standard theorems of Euclidean geometry, angle chasing, ratio chasing, and distance relationships.
DDAR is sound: it only derives facts that are logically entailed by the premises. It is also complete within its rule set, meaning it will eventually derive every fact that follows from the premises using the available rules. This soundness property is what guarantees that AlphaGeometry's proofs are correct.
The engine handles both numeric and symbolic relationships. Angle chasing, for example, assigns angle labels and tracks equalities symbolically rather than numerically, which allows DDAR to prove angle equalities that hold for general positions of the figure rather than just for a specific diagram.
The neural language model can be understood as a learned heuristic for choosing which auxiliary construction to make at each step. When DDAR stalls, the language model receives a representation of the current proof state and outputs a probability distribution over possible construction tokens. The most likely constructions are tried first.
During training, the model learned to associate proof states with the kinds of auxiliary constructions that historically led to proofs in the synthetic training data. This is analogous to how a human student learns to recognize patterns: after seeing many cyclic quadrilateral proofs, for example, a student learns that drawing a circle through three specific points is often a productive move.
The language model does not verify whether its proposed constructions are geometrically valid (e.g., whether a claimed intersection actually exists). DDAR handles validation: if the proposed construction is geometrically impossible or does not add any new useful information, DDAR's deduction closure does not grow and the system backtracks.
AlphaGeometry uses a beam search procedure over proof attempts. At each step, the language model proposes multiple candidate constructions ranked by probability. DDAR is run for each candidate. If any candidate leads to a proof, the search terminates. If not, the search expands the most promising partial proof states in the beam and repeats.
This approach is analogous to how Monte Carlo Tree Search was used in AlphaGo: the neural model provides fast, heuristic guidance over a large search space that a purely symbolic system could not efficiently navigate.
AlphaGeometry 2 (AG2) was announced publicly on July 25, 2024 alongside AlphaProof, following its deployment at IMO 2024. A detailed technical paper, "Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeometry2," was published on arXiv in February 2025 (arXiv:2502.03544).
AG2 differs from the original system in several important ways:
Language model. The language model in AG2 is based on the Gemini architecture rather than the smaller custom transformer used in AG1. The model was trained at scales up to 3.3 billion parameters on TPUv4 hardware.
Training data scale. AG2 was trained on approximately 300 million synthetic theorems, roughly an order of magnitude more than the 100 million used for AG1. The data generation procedure was also improved to produce more diverse and harder examples.
Expanded formal language. The AG2 language was extended to handle several classes of problems that AG1 could not express:
These extensions increased the coverage of IMO geometry problems (2000-2024) from 66% to 88%.
Improved symbolic engine. The DDAR engine was re-engineered for AG2 with a reduced rule set, faster inference, and improved handling of double points (configurations where the same point arises from two different constructions).
Shared Knowledge Ensemble of Search Trees (SKEST). AG2 introduced a novel search algorithm that runs multiple parallel search trees and shares intermediate results between them. When one search tree discovers a useful auxiliary construction, other trees can benefit from that discovery, which accelerates the overall search.
Natural language input pipeline. AG2 includes a pipeline using Gemini to translate geometry problems from natural language into the formal AG language, and an automated diagram generation algorithm. This makes the system more practical for end-to-end use without manual formalization.
AG2 was evaluated on all 50 IMO geometry problems from 2000 to 2024 that could be expressed in the expanded formal language. It solved 42 of these 50 problems, a success rate of 84%. AG1 had solved 54% of problems in its original (smaller) benchmark. The average IMO gold medalist solves approximately 90% of geometry problems, placing AG2 just below that threshold on the formal benchmark.
| System | Problems solved | Benchmark | Notes |
|---|---|---|---|
| AlphaGeometry 1 | 25/30 | IMO-AG-30 (AG1 language) | 83% |
| AlphaGeometry 2 | 42/50 | IMO geometry 2000-2024 (AG2 language) | 84% |
| Average IMO gold medalist | ~27/30 | IMO-AG-30 | ~90% |
The 2024 International Mathematical Olympiad was held in Bath, United Kingdom in July 2024. After the competition, Google DeepMind tested AlphaProof and AlphaGeometry 2 on the six official 2024 IMO problems.
Problem 4 at IMO 2024 was a geometry problem requiring proof that two angles sum to 180 degrees in a specific configuration involving a triangle, its incenter, and several auxiliary points. AG2 solved the problem within 19 seconds after receiving its formalization.
AG2's solution introduced a point E on the line BI (where I is the incenter) such that the angle AEB equals 90 degrees. This auxiliary construction, which does not appear in the problem statement, creates several pairs of similar triangles (including ABE ~ YBI and ALE ~ IPC) that allow the required angle relationship to be derived. The solution was awarded full marks (7 out of 7 points) by Joseph Myers, a two-time IMO gold medalist and Chair of the IMO 2024 Problem Selection Committee.
AlphaProof is a separate DeepMind system that handles non-geometry problems using reinforcement learning over formal proofs in the Lean 4 theorem prover. At IMO 2024, AlphaProof solved Problems 1, 2, and 6 (an algebra problem, an algebra problem, and a number theory problem respectively). Problem 6 was solved by only five of the approximately 609 human contestants.
Together, AlphaProof and AlphaGeometry 2 solved 4 of the 6 IMO 2024 problems, accumulating 28 of the possible 42 points. The silver medal threshold at IMO 2024 was 29 points; the gold medal threshold was 29 points for silver and higher for gold (specifically the cutoff was 29 for silver and 34 for gold). The combined AI score of 28 placed the system just at the boundary of silver medal performance, one point below the gold medal threshold.
Problems 3 and 5 were not solved. Problem 3 was a combinatorics problem outside AlphaProof's scope at the time.
Wu's method, developed by Chinese mathematician Wen-Tsun Wu in the 1970s, reduces geometric theorem proving to polynomial arithmetic using characteristic set decomposition. It is an algebraic approach rather than a synthetic one and can handle a different class of problems than DDAR.
A 2024 paper by Sinha et al. (arXiv:2404.06405) revisited the AlphaGeometry benchmark and found that the original comparison significantly underestimated Wu's method. When run properly with adequate time limits, Wu's method alone solved 15 of the 30 IMO-AG-30 problems, not the 10 reported in the original paper.
| System | IMO-AG-30 solved | Notes |
|---|---|---|
| Wu's method (original comparison) | 10 | As cited in AlphaGeometry Nature paper |
| Wu's method (revised evaluation) | 15 | Sinha et al. 2024, CPU-only laptop, 5 min limit |
| DDAR alone | 14 | No auxiliary constructions |
| Wu's method + DDAR | 21 | Combined symbolic approach |
| AlphaGeometry 1 | 25 | Neuro-symbolic |
| AlphaGeometry 1 + Wu's method | 27 | Combined, first system above gold medalist level |
| Average IMO gold medalist | 25.9 | Human reference |
Crucially, Wu's method solved 2 problems that AlphaGeometry failed on, and AlphaGeometry solved problems that Wu's method failed on. The two approaches have complementary strengths. Combining them produced a new state of the art (27/30) that exceeded the average human gold medalist for the first time.
This finding does not diminish AlphaGeometry's significance but shows that the classical symbolic approaches were stronger than they had been given credit for in the initial publication.
At the time of the Nature paper's publication, Google DeepMind released the AlphaGeometry code and model weights as open source. The primary GitHub repository is google-deepmind/alphageometry and contains:
Model weights are hosted separately on Hugging Face (Wauplin/alphageometry). The code is released under an open-source license, allowing researchers to reproduce, study, and build on the system.
A separate repository, google-deepmind/alphageometry2, was released in early 2025 containing the AlphaGeometry 2 symbolic engine (DDAR) with examples. The full AG2 language model weights were not publicly released at the time of the arXiv paper publication.
The primary significance of AlphaGeometry is as a demonstration that neuro-symbolic systems can achieve expert human performance on formal reasoning tasks by training entirely on synthetic data, without access to human demonstrations.
Automated theorem proving. AlphaGeometry contributed to renewed interest in automated theorem proving (ATP) as a serious AI research direction. Prior ATP systems for geometry were either fast but limited (DDAR, angle chasers) or general but slow (algebraic methods). AlphaGeometry showed that a learned auxiliary construction oracle could break through the ceiling of pure symbolic methods.
Synthetic data for reasoning. The synthetic data generation technique has broader implications. The approach of generating training data by forward sampling (random configurations) and backward filtering (identifying what constructions were needed) is applicable to other formal reasoning domains beyond geometry. It sidesteps the bottleneck of needing human experts to write proofs.
Mathematical research assistance. While AlphaGeometry was designed for competition problems, the same neuro-symbolic architecture could in principle assist research mathematicians working on problems that involve geometric structures. Researchers investigating geometric configurations, new theorems, or extensions of classical results could use such systems to automate search over large spaces of possible constructions.
Education. The system can generate human-readable proofs with explicit construction steps, making it potentially useful as an educational tool for students learning geometry proof techniques. A student could use the system to check their work or to see alternative proof approaches for a given problem.
End-to-end natural language geometry. AlphaGeometry 2's Gemini-based translation pipeline represents a step toward systems that can accept geometry problems stated in plain text (as they would appear in a textbook or competition paper) rather than requiring manual formalization. This is important for practical deployment in educational or research settings.
AlphaGeometry has several important limitations that qualify the significance of its performance:
Scope restriction. The original system can only handle problems expressible in its formal language, which covered 66% of IMO geometry problems. AlphaGeometry 2 extended this to 88%, but problems involving geometric inequalities, area calculations, affine geometry in higher dimensions, or combinatorial arguments remain outside the current scope.
Formalization bottleneck. For the IMO 2024 evaluation, DeepMind staff manually formalized the problem statements into the AlphaGeometry language. Full automation of the translation from natural language to formal language remained imperfect, with the Gemini-based translation pipeline introduced in AG2 being a partial solution rather than a complete one.
Proof verbosity. AlphaGeometry's proofs are sometimes more verbose than elegant human proofs. A system that introduces three auxiliary constructions to reach a conclusion that a skilled human might reach in one step is correct but not necessarily illuminating.
No novel mathematics. AlphaGeometry is a theorem prover, not a theorem discoverer. It takes a stated conjecture and attempts to prove it. It does not generate new conjectures or identify interesting structure in a domain unprompted.
Geometry only. Neither AlphaGeometry 1 nor AlphaGeometry 2 handles algebra, number theory, combinatorics, or analysis. These areas require separate systems (such as AlphaProof for formal algebra and number theory).
Computational cost. The full AG2 system requires substantial computing resources. Some IMO 2024 problems took AlphaProof up to three days of computation; AG2's P4 solution was unusually fast at 19 seconds, but the system may require significant resources for harder cases.
The Nature paper received substantial attention from the mathematics and AI research communities. Nature published an accompanying commentary titled "This AI just figured out geometry," and the work was covered by Bloomberg, The Decoder, and other technology press.
Several mathematicians noted that while AlphaGeometry's performance was impressive, the reliance on a fixed formal language and the requirement for manual problem formalization meant the system was not yet a general-purpose mathematics assistant. The subsequent development of AG2 and the IMO 2024 results addressed some of these concerns by expanding the language and improving natural language translation.
The discovery that Wu's method was stronger than reported in the original paper generated some discussion about proper benchmarking methodology. The authors of the follow-up study emphasized that the two approaches are complementary and that combining them yields better results than either alone.