AlphaGeometry 2
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,445 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,445 words
Add missing citations, update stale details, or suggest a clearer explanation.
AlphaGeometry 2 (often abbreviated AG2) is a neuro-symbolic artificial intelligence system developed by Google DeepMind for solving Olympiad-level Euclidean geometry problems. It is the direct successor to alphageometry (AG1), the system that DeepMind unveiled in January 2024. AlphaGeometry 2 was first announced publicly on 25 July 2024 as part of a joint result with the sibling system alphaproof: together, the two systems achieved a silver-medal-equivalent score of 28 out of 42 points on the problem set of the 2024 International Mathematical Olympiad (IMO), solving four of the six problems and falling just one point short of the gold-medal cutoff. AlphaGeometry 2 itself was responsible for problem 4, the geometry question, which it solved within 19 seconds of receiving a formalised statement. A detailed technical paper describing the system, titled "Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeometry2," appeared on arXiv on 5 February 2025 as preprint 2502.03544.[^1][^2]
Compared with AlphaGeometry 1, AlphaGeometry 2 retains the same broad neuro-symbolic architecture, in which a language model proposes auxiliary geometric constructions and a symbolic deduction engine closes the proof, but it strengthens essentially every component. The auxiliary-construction proposer is now based on the Gemini family of language models rather than a transformer trained from scratch. The symbolic engine, renamed DDAR2, was rewritten in C++ and made roughly 300 times faster than the original DDAR while gaining new capabilities such as handling coincident "double" points. The synthetic training corpus was expanded both in complexity and balance, with proofs up to ten times longer than in AG1 and a roughly 50:50 mix of problems requiring and not requiring auxiliary constructions. Finally, a new search procedure called the Shared Knowledge Ensemble of Search Trees (SKEST) runs multiple beam searches in parallel and lets them share verified intermediate facts. On the IMO-AG-50 benchmark, a curated set of 50 geometry problems drawn from IMO papers between 2000 and 2024, AlphaGeometry 2 solves 84% of the problems (42 of 50), compared with 54% for AlphaGeometry 1, exceeding the average IMO gold-medalist solve count of 40.9.[^1][^3]
AlphaGeometry was introduced in a January 2024 Nature paper by Trinh, Wu, Le, He and Luong, titled "Solving olympiad geometry without human demonstrations." AG1 solved 25 of 30 problems in IMO-AG-30, a benchmark of 30 IMO geometry problems from 2000-2022, matching the average gold medalist on that subset. AG1 paired a small transformer language model, trained from scratch on roughly 100 million synthetic Euclidean geometry theorems and proofs, with a deductive symbolic engine called DDAR (Deductive Database Arithmetic Reasoning). DDAR computed the "deductive closure" of a problem statement using a fixed library of geometric rules, while the language model proposed auxiliary points or lines whenever DDAR alone could not finish the proof.[^4]
AG1's neuro-symbolic split exploited a fundamental property of olympiad geometry: most of the reasoning is mechanical chasing of equal angles, similar triangles, cyclic quadrilaterals and ratio identities, but the human-recognisable creative step is the introduction of an auxiliary construction, such as drawing a particular midpoint, foot of perpendicular, or circle. AG1 produced its synthetic training data without any human-written proofs by sampling random geometric configurations, computing their deductive closure, and then designating any nontrivially-implied fact as a "theorem" with the random configuration as the (possibly auxiliary-point-laden) proof. Despite reaching near-gold performance on IMO-AG-30, AG1 had three well-documented shortcomings. First, its domain-specific language could not express common families of olympiad problems involving moving points (locus problems), arbitrary linear combinations of angles or distances, or even certain compound statements; AG1 covered only about 66% of IMO geometry problems from 2000-2024. Second, the language model was small and trained from scratch, so it generalised poorly outside the synthetic distribution. Third, DDAR ran in pure Python with worst-case complexity as high as O(N^8) for similar-triangle search, making the symbolic step the dominant bottleneck of the pipeline.[^1][^4]
These limitations defined the goals for AlphaGeometry 2.
On 25 July 2024, Google DeepMind announced via its blog and on social media that two new systems, AlphaProof and AlphaGeometry 2, had jointly produced solutions to four of the six problems on the official 2024 IMO paper, held in Bath, United Kingdom that month. The post, titled "AI achieves silver-medal standard solving International Mathematical Olympiad problems," reported a total score of 28 out of 42 points. The 2024 silver-medal cutoff was 22 points and the gold-medal cutoff was 29 points, leaving the combined DeepMind system one point short of gold and well above the silver threshold. The systems were evaluated under conditions that broadly mirrored the human competition: they were given the official problem statements only after the contest, but the time limits were relaxed and the natural-language problem statements were first hand-translated by mathematicians into formal input for the systems.[^2]
Of the six 2024 problems, AlphaProof (a separate system that combined a Gemini language model with a Lean formal theorem prover and AlphaZero-style reinforcement learning) addressed the algebra and number-theory problems. AlphaProof solved problems 1, 2 and 6, scoring full marks on each. AlphaGeometry 2 was deployed against problem 4, the lone geometry problem, and produced a correct proof in 19 seconds of compute after the problem was translated into the AlphaGeometry domain-specific language. The remaining two problems, both combinatorial, were not solved. Each of the four solved problems received the full seven points from the IMO's marking judges, two of whom were past IMO problem-setters, yielding the 4 x 7 = 28 point total.[^2]
The 2024 IMO geometry problem (problem 4) asked solvers to consider a triangle ABC with incenter I, to place auxiliary points using prescribed tangency and parallelism conditions, and to prove that two angle sums are supplementary. AlphaGeometry 2 found a synthetic proof by constructing a new point E on line BI such that the angle AEB is a right angle. This auxiliary point generated two pairs of similar triangles whose corresponding angles, when chained through the deductive engine, closed the supplementary-angle identity without resorting to coordinates or trigonometry.[^1][^2]
The 2024 IMO joint result was the first public demonstration of an AI system reaching silver-medal performance on an unseen IMO contest. The 84% historical solve rate on the IMO-AG-50 benchmark, reported a few months later, also marked the first published claim of an AI system surpassing the average IMO gold medalist (40.9 solved problems out of 50) on a substantial geometry benchmark.[^1][^2]
The AlphaGeometry 2 paper identifies five categories of improvement over AG1: the domain-specific language, the synthetic data, the symbolic engine, the language model, and the search algorithm. Each is summarised in this section and detailed in dedicated sections below.
The domain language was extended to express two new families of statements that were previously out of reach. The first is "locus" problems, in which one or more points, lines or circles move continuously and the assertion is that a derived object satisfies some property. AG2 introduced explicit constructs for declaring a free movement and for asserting a property at the limit. The second is arbitrary linear equations of angles, ratios or directed lengths, which previously had to be hand-decomposed into pairwise equalities. Combined, these extensions raised the language's coverage of historical IMO geometry problems from approximately 66% to 88%.[^1]
The synthetic data pipeline was rebuilt to produce a larger, more diverse and better-balanced corpus. AG2 explored random geometric diagrams of roughly twice the size of AG1's diagrams, generated theorems with up to twice as many points and premises, and produced proofs with up to ten times as many deductive steps. The corpus was rebalanced to contain roughly equal numbers of problems requiring and not requiring auxiliary points, compared with a roughly 9% / 91% split in AG1. The total corpus reached approximately 300 million theorem-proof pairs (Trinh et al. reported around 100 million for AG1; AG2's pipeline produced roughly three times as many).[^1][^3]
The symbolic engine was renamed DDAR2 and refactored into C++. New algorithms reduced the worst-case complexity of detecting similar triangles and cyclic quadrilaterals from O(N^8) to roughly cubic time, while the core Gaussian-elimination loop used by the arithmetic-reasoning sub-engine was rewritten directly in C++. End-to-end, DDAR2 runs about 300 times faster than DDAR on the benchmark suite, reducing the average per-problem symbolic-engine call from roughly 1,180 seconds to about 3.4 seconds. DDAR2 also handles a new family of "double-point" configurations in which two independently-named points happen to coincide.[^1]
The language model was replaced by a Gemini-based sparse mixture-of-experts transformer trained at a much larger scale than AG1's bespoke architecture. The authors experimented with several variants, including training from scratch with a geometry-specific tokenizer, fine-tuning a math-specialised pretrained Gemini checkpoint, and a multimodal variant that received the geometry diagram as an image in addition to the formal premises.[^1]
The search algorithm is no longer a single beam search. AG2 introduced the Shared Knowledge Ensemble of Search Trees (SKEST), a procedure in which several beam searches with different hyperparameters and auxiliary-construction strategies run in parallel and write any deductively-verified intermediate facts to a shared knowledge base that all the other beams can read. The improved search and language model together raise the pure language-model contribution to the overall solve rate from about 42 percentage points to about 70 percentage points (the remaining gap is closed by DDAR2's improved deductive coverage).[^1]
DDAR2 is the algorithmic descendant of DDAR, the deduction engine of AlphaGeometry 1. Both engines compute the deductive closure of a set of geometric premises with respect to a fixed library of geometric rules: equal angles produce parallel lines, parallel lines combined with transversals produce equal angles, three concyclic points combined with a fourth equidistant point produce a circle, and so on. The closure is built bottom-up: the engine repeatedly scans all current facts, looks for rule premises, derives new facts and continues until either the goal is proved or no new facts are produced.[^1]
DDAR2 makes four important changes to this base algorithm. First, it adds a separate "double-point" sub-procedure that allows the engine to reason about distinct point names that turn out to refer to the same geometric point. The procedure works by attempting to prove that two named points both lie on the same two lines, in which case the engine concludes that the points coincide; once equated, all of their named-property facts are merged. This case occurs frequently in olympiad problems that ask to show three lines are concurrent or three points are collinear, because the standard proof technique is to construct one of the points twice in different ways and prove the two constructions agree.[^1]
Second, DDAR2 replaces DDAR's naive enumeration of similar-triangle pairs with a hash-based detection step. The engine iterates once over all triples of points, computes a canonical "shape signature" for each triple (essentially a sorted tuple of three normalised angle measures), and detects a similar pair the moment any signature recurs. The resulting complexity drops from O(N^8) in DDAR to roughly cubic in the number of points. A similar hashing trick is applied to cyclic-quadrilateral detection.[^1]
Third, DDAR2's arithmetic-reasoning sub-engine, which closes the deductive loop for facts that involve numerical equalities of angles, ratios or distances, was rewritten in C++. The sub-engine uses a Gaussian-elimination procedure to determine whether a target linear equation follows from a set of premise equations. In DDAR1 this loop was implemented in Python and consumed the bulk of running time; the C++ rewrite delivers an end-to-end speedup of more than two orders of magnitude on the benchmark.[^1]
Fourth, DDAR2 was made compatible with the expanded AlphaGeometry domain language, including locus statements and arbitrary linear equations of geometric quantities. The engine's rule library was correspondingly extended, although the bulk of the new rules can be derived mechanically from the existing equality-and-parallelism rules combined with the linear-algebraic representation of geometric quantities.[^1]
On the IMO-AG-50 benchmark, DDAR2 alone (without the language model) solves 16 of the 50 problems, compared with 14 for DDAR1. The new symbolic capabilities are most visible in problems that involve degenerate or coincident configurations, where AG1's engine could not even reason about the question.[^1]
The AlphaGeometry 2 language model is built on the Gemini family of foundation models, specifically a sparse mixture-of-experts transformer architecture. The authors evaluated three training regimes. The first regime trained a Gemini-architecture model from scratch with a custom tokenizer matched to the geometry domain language, using only the synthetic corpus. The second regime fine-tuned a pretrained Gemini math model on the synthetic geometry corpus, transferring general mathematical pretraining into the formal-geometry domain. The third regime added a multimodal variant in which the model received an image rendering of the geometric diagram alongside the formal premise list.[^1]
The model's job at inference is to read a formalised geometry problem and emit a sequence of auxiliary-point constructions, one per "step," with the symbolic engine called after each step to test whether the resulting deductive closure contains the goal. Each auxiliary construction is a single line in the AlphaGeometry domain language, such as "let X be the foot of the perpendicular from A to line BC." If DDAR2 closes the proof, the search terminates; otherwise, the language model continues to the next step.[^1]
Inference uses top-k sampling with k = 32 and temperature 1.0, rather than greedy decoding. Sampling diversity is essential because most auxiliary points proposed by the model do not lead to a successful proof; the search needs many independent attempts. The authors report that a beam size of 128 with beam depth 4 and 32 samples per beam-step is roughly optimal: increasing any of these parameters further does not yield additional solved problems.[^1]
Interestingly, the multimodal model that received the diagram image did not, on its own, solve more problems than the text-only model. The authors interpret this as evidence that the language model already encodes enough geometric structure from the formal premise list. The multimodal model nonetheless contributed in ensemble configurations, where its solutions only partly overlapped those of the text-only model.[^1]
AlphaGeometry 2 retains the same general philosophy as AlphaGeometry 1 for synthetic training data generation: sample random geometric configurations, compute their deductive closures using the symbolic engine, and label any nontrivially implied conclusion as a "theorem" whose proof is the random configuration plus the closure derivation. The pipeline does not use any human-written proofs.[^1][^4]
What changed from AG1 to AG2 is the scale, diversity and balance of the resulting corpus. AG2 explores random diagrams roughly twice the size of AG1's diagrams (in terms of number of points), generates theorems with up to twice as many points and premises, and admits proofs up to ten times as long. The synthetic corpus reaches approximately 300 million theorem-proof pairs, compared with roughly 100 million reported for AG1. A second axis of change is balance: AG1's pipeline produced theorems with auxiliary-construction-requiring proofs only about 9% of the time, since most random sampled configurations have short, mechanical closures. AG2 explicitly oversamples auxiliary-rich theorems by detecting in post-processing whether the proof closure used a point that was not in the original premise set, producing a corpus that is roughly 50% auxiliary-construction problems.[^1]
A third innovation is the inclusion of new problem types absent from AG1's corpus, including locus problems (where one input point moves continuously) and problems whose statements use general linear equations of angles or distances. These additions were necessary to match the new domain-language extensions and they account for much of the improved benchmark coverage.[^1]
The data-generation algorithm itself was also accelerated. AG2 uses a greedy minimal-point algorithm that processes points in reverse-topological order of the proof DAG, which the authors argue is asymptotically more efficient than AG1's exhaustive subset enumeration. The combination of expanded diagrams, balanced auxiliary distribution and efficient algorithms allows AG2's synthetic corpus to be produced with computational cost comparable to AG1's.[^1]
The AG1 search procedure was a single beam search over auxiliary-construction sequences, with DDAR called after every construction to test the goal. AG2 replaces this with the Shared Knowledge Ensemble of Search Trees (SKEST), a parallel algorithm that runs several beam searches simultaneously, each with different hyperparameters and construction strategies.[^1]
A SKEST configuration consists of several search-tree variants. One variant proposes a single auxiliary construction per node; another proposes a batch of multiple constructions per node. Some variants restrict construction types to a uniform distribution over the language's construction vocabulary; others use the language model's posterior directly. One configuration uses a deep, narrow tree (beam size 64, depth 10); another uses a shallow, wide tree (beam size 512, depth 4). Each variant runs in parallel on its own pool of TPU accelerators.[^1]
The "shared knowledge" component is the key contribution. When DDAR2 verifies a new intermediate fact inside any one search tree, that fact is filtered for "interestingness" (essentially, whether it is a nontrivial geometric relation rather than a routine premise) and written to a knowledge base that is broadcast to all the other trees. Other trees can then read the shared facts and use them as additional premises for their own DDAR2 calls. This communication mechanism allows facts established by an aggressive deep search to immediately benefit a complementary shallow search, and vice versa, without the cost of recomputing them.[^1]
The system is deployed on TPUv4 accelerators, with multiple replicas of each language-model variant serving auxiliary-construction proposals to a larger pool of DDAR2 workers that consume them. The asynchronous architecture allows the language-model and symbolic-engine components to scale independently. Inference for the IMO 2024 problem 4 took 19 seconds wall-clock, but the system can also run for hours on harder problems.[^1][^2]
The 2024 International Mathematical Olympiad took place in Bath, United Kingdom on 11-22 July 2024. Approximately 600 students from 108 countries competed, attempting six problems over two days (problems 1-3 on day one, problems 4-6 on day two). Each problem is worth 7 points for a maximum total of 42; in 2024, the gold-medal cutoff was 29 points, the silver cutoff 22, and the bronze cutoff 16.[^2]
DeepMind's combined system tackled the six problems as follows: AlphaProof addressed problems 1, 2, 3, 5 and 6 (the algebra, number-theory and combinatorics problems), while AlphaGeometry 2 addressed problem 4 (the geometry problem). AlphaProof was a separate neuro-symbolic system that paired a Gemini-based language model with a Lean formal theorem prover and used an AlphaZero-style reinforcement-learning loop to bridge the gap between informal mathematical reasoning and Lean's verifier. AlphaProof solved problems 1, 2 and 6, each for the full 7 points, but did not solve problems 3 and 5 within the allowed time. AlphaGeometry 2 solved problem 4 in 19 seconds, again for the full 7 points. The combined total was therefore 4 x 7 = 28 of a maximum 42 points, one point below the gold-medal cutoff.[^2]
A subtle point about the joint result is that the systems' outputs were not generated under contest time pressure. Human IMO contestants have 4.5 hours per day to attempt three problems; the DeepMind systems were allowed substantially longer: AlphaProof's longest single solution took roughly three days of compute, although AlphaGeometry 2's geometry proof took 19 seconds. A second caveat is that the problem statements were first translated into formal input languages (Lean for AlphaProof, the AlphaGeometry domain language for AG2) by human mathematicians. DeepMind explicitly noted both caveats in the announcement; the medal-equivalent claim refers to the final point total relative to the IMO marking scheme, not to a fully end-to-end natural-language pipeline.[^2]
The grading was performed by two mathematicians who were past IMO problem-setters, Professor Sir Timothy Gowers and Dr Joseph Myers, ensuring that the marks were comparable to those that the official IMO jury would have awarded.[^2]
The arXiv paper includes detailed walk-throughs of several solutions that highlight the system's ability to find non-obvious auxiliary constructions. In addition to IMO 2024 problem 4, where AG2 introduced the auxiliary point E on line BI with a right angle at AEB, the paper describes three further cases that have been frequently cited.[^1]
On IMO 2013 problem 3, AG2 constructed the midpoint D of the arc that contains B in the circumscribed circle of ABC, a non-symmetric and historically unconventional auxiliary point, and used it to prove the concyclicity of B, A_1, D and I_a. The authors note that even strong human solvers tend to attempt symmetric constructions first, while AG2's search procedure has no such bias.[^1]
On IMO 2014 problem 3, AG2 constructed a circumcenter that allowed it to prove a strictly stronger statement than the one asked, namely OH perpendicular to BD; the original goal followed as an immediate corollary. The example illustrates a frequently-observed pattern in AG2's output: the system sometimes finds a generalised intermediate identity that is easier to prove than the original goal.[^1]
On IMO Shortlist 2009 problem G7, AG2 introduced 12 auxiliary points, including several circumcenters and reflections, demonstrating that the system can manage construction sequences far deeper than a typical human approach. The paper notes that such proofs would generally be considered too unwieldy by olympiad judges, even if formally correct, but their existence shows that AG2's search reaches genuinely deep auxiliary configurations.[^1]
On the 50-problem IMO-AG-50 benchmark, the AG2 paper compares the system's solve count against the average problem-by-problem solve count of three classes of IMO medallists. The authors compute the average solve count by mapping each historical IMO geometry problem to the average score that students at each medal level achieved on that problem, then summing to a benchmark-level solve count. The reported reference points are 27.1 problems for the average bronze medallist, 33.9 for the average silver medallist and 40.9 for the average gold medallist. AG2 solves 42 of 50, slightly above the average gold medallist.[^1]
The same paper also reports a stripped-down baseline of DDAR2 alone (no language model), which solves 16 of 50, comparable to the score of a strong contestant who attempts only the easiest geometry questions. The TongGeometry system, a contemporary university-led project, solves 30 of 30 on the IMO-AG-30 subset; AG2 solves the corresponding 42 of 50 on the broader IMO-AG-50.[^1]
The AlphaGeometry 2 paper is explicit about several remaining limitations. Of the 50 IMO 2000-2024 geometry problems, 8 remain unsolved by AG2: 2 of these were attempted but not solved within the allotted compute budget, and 6 cannot even be expressed in the AlphaGeometry domain language. The unformalisable problems fall into categories the domain language still does not cover: three-dimensional geometry, inequalities, statements with non-linear equations of geometric quantities, and configurations involving countably many points.[^1]
A second limitation is that AG2 requires its input in the AlphaGeometry domain language; the natural-language problem statement must be translated by a human mathematician (or another model) before AG2 can attempt it. The paper notes that automating this auto-formalisation step is an explicit subject of ongoing work, and reports preliminary experiments with Gemini-based auto-formalisers.[^1]
A third limitation is that, although AG2 surpasses the average gold medallist on the IMO-AG-50 benchmark, the comparison is to the population of problem solves on past problems; it does not claim that AG2 would have won a gold medal at a specific IMO in real-time conditions, since AG2 was not under the IMO's 4.5-hour daily time limit. In the 2024 IMO, AG2's geometry proof did take only 19 seconds, but its overall combined-system performance with AlphaProof was achieved with much longer compute budgets and with hand-formalised problem statements.[^1][^2]
Finally, AG2 is a domain-specific system: its language model emits constructions only in the AlphaGeometry domain language, its symbolic engine reasons only about Euclidean plane geometry, and the system has no straightforward way to generalise to other mathematical fields. The companion AlphaProof system, which uses Lean as its proof assistant, is a step toward a more general system, but the two systems remain separate in the 2024 announcement.[^2]
The full technical description of AlphaGeometry 2 was posted to arXiv on 5 February 2025 as preprint 2502.03544, with the title "Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeometry2." The authorship list comprises Yuri Chervonyi, Trieu H. Trinh, Miroslav Olsak, Xiaomeng Yang, Hoang Nguyen, Marcelo Menegali, Junehyuk Jung, Junsu Kim, Vikas Verma, Quoc V. Le and Thang Luong. The first author is Yuri Chervonyi, and Trieu Trinh, who was first author of the AG1 Nature paper, is second; the senior authors are Quoc Le and Thang Luong of Google DeepMind. The paper has been revised twice on arXiv: v2 appeared on 28 February 2025 and v3 on 8 December 2025; the final v3 spans 34 pages with 16 figures.[^1]
In addition to documenting AG2's architecture, the paper introduces the IMO-AG-50 benchmark (extending AG1's IMO-AG-30 benchmark by adding all IMO geometry problems from 2023 and 2024 plus a wider trawl of 2000-2024 problems) and reports detailed ablations isolating the contribution of each AG2 component. The ablations show that the synthetic-data scale and balance changes alone improve the solve rate from 54% to around 70%, the Gemini-based language model adds a further 7-9 percentage points, DDAR2's speed enables substantially larger search budgets that yield several more solved problems, and SKEST's knowledge sharing closes the gap to 84%.[^1]
A companion code release, the alphageometry2 GitHub repository at github.com/google-deepmind/alphageometry2, contains the DDAR2 symbolic engine implementation (in Python with NumPy as the only dependency) and a suite of 25 IMO test problems demonstrating its solve capability. The neural network component is not released publicly; the repository explicitly states that it is "not an official Google product."[^5]
As of mid-2026, the AlphaGeometry 2 work has not been published as a peer-reviewed Nature article (unlike AlphaGeometry 1, which was published in Nature in January 2024), although the related AlphaProof system was the subject of a Nature paper in early 2026. The AlphaGeometry 2 results remain documented through the DeepMind blog announcement and the arXiv preprint.[^1][^2]