AI Co-Mathematician
Last reviewed
Jun 3, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 · 1,622 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 · 1,622 words
Add missing citations, update stale details, or suggest a clearer explanation.
AI Co-Mathematician is an interactive, agentic research system built by Google DeepMind to help professional mathematicians work on open-ended research problems. Rather than answering a question in a single response, it behaves like a shared workbench: a team of specialized AI agents searches the literature, writes and runs code, attempts proofs, and tracks dead ends while the mathematician steers the investigation. DeepMind described it in a paper titled "AI co-mathematician: Accelerating mathematicians with agentic AI," posted to arXiv on 7 May 2026 (arXiv:2605.06651) and led by Daniel Zheng, with Alex Davies and Pushmeet Kohli among the senior authors.[1][2] The system reportedly scored 48% on the hardest tier of the FrontierMath benchmark, the highest result among AI systems evaluated at the time.[1][3]
The core idea is a shift away from the chatbot pattern. Most large language model tools take a prompt and return a finished answer. Research mathematics rarely works that way. A problem might sit unsolved for months, with the mathematician chasing one idea, abandoning it, trying a computation, reading an old paper, and circling back. The AI co-mathematician was built around that messy, iterative reality instead of around clean question-and-answer exchanges.[2]
DeepMind compares the design to agentic coding environments such as Claude Code, which give an AI the scaffolding to work autonomously over long stretches while staying steerable.[4] The mathematician does not hand over the problem and wait. They set goals, inspect intermediate work, correct the agents when they go astray, and pick up promising threads the system surfaces. The paper frames this as an interactive paradigm for AI-assisted discovery, where the human stays in the loop throughout rather than judging a single end-to-end output.[2]
The system is built on the latest Gemini models. Most of the internal agents run on Gemini 3.1 Pro, and the dedicated prover agent can call on Gemini Deep Think for harder proof attempts.[2]
The workspace is stateful and asynchronous, which is the part that makes it feel less like a single model and more like a small research group. Work persists between sessions. Multiple agents run in parallel, and the mathematician can check in on any of them without blocking the others.[2]
At the top sits a project coordinator agent. It refines what the user actually wants through dialogue, breaks the work into goals, and delegates those goals to workstream coordinator agents that carry out linear sequences of actions.[2][4] Underneath the coordinators are specialized sub-agents. The agents share a common filesystem and talk to each other through an internal messaging system, so a result from the coding agent can feed the prover, and a reference dug up by the literature agent can reshape a proof strategy.[2]
The named components map onto how a mathematician actually works:
| Capability | What the agents do |
|---|---|
| Ideation | The coordinator turns a vague research goal into concrete subproblems and candidate approaches |
| Literature search | A review sub-agent hunts for relevant papers, including overlooked or obscure references |
| Computational exploration | A coding agent writes and runs code to test conjectures and gather numerical evidence |
| Theorem proving | A prover agent, able to use Gemini Deep Think, attempts formal and informal proofs |
| Theory building | The system assembles partial results into larger arguments and structured write-ups |
| Review | Reviewer agents persist across review rounds and flag flawed work before it is presented |
A detail worth dwelling on is the review loop. Reviewer agents do not vanish after one pass. They stay across rounds, which the authors say keeps the system from quietly declaring a broken proof finished.[2] In a field where a single wrong lemma can sink an entire argument, that persistence matters more than it might in other domains.
The output is not a chat transcript. The system produces compiled, reviewed LaTeX write-ups, which DeepMind calls native mathematical artifacts.[2] These documents are meant to be audited, not just read. Margin annotations link individual claims back to the workspace, so a reader can trace where a particular bound came from. The paper gives an example annotation noting that a pruning heuristic was derived from a user suggestion and that a baseline value of 2.2195 was sourced from a specific arXiv paper.[2] The write-ups also explain the research process that led to a result, including which paths failed, rather than presenting only the polished final answer.[2]
The benchmark claim is the headline number, so it is worth stating precisely. FrontierMath is a math benchmark maintained by Epoch AI. Its Tier 4 set is the hardest layer: 50 research-level problems (2 public, 48 private) crafted and vetted by professional mathematicians, with some problems taking experts days to solve. Epoch AI has described Tier 4 as designed to surpass Tier 3 in difficulty, with some problems potentially remaining unsolved by AI for years.[3][5]
On that set, the AI co-mathematician scored 48%, which the paper reports as 23 of the 48 scored problems and as a new high among all AI systems evaluated.[1][2] The agentic setup more than doubled the standalone model: Gemini 3.1 Pro on its own reached about 19% on the same problems.[1][6] Reported comparison figures placed it ahead of GPT-5.5 Pro at 39.6%, GPT-5.4 Pro at 37.5%, and Claude Opus 4.7 and 4.6 at 22.9%.[1][6] DeepMind also noted that the system solved three problems no previously evaluated system had cracked.[2]
One caveat belongs here for honesty's sake. On 11 May 2026 Epoch AI announced an AI-assisted review of FrontierMath Tiers 1 through 4 that flagged possible fatal errors in roughly a third of problems.[5] That does not erase the result, but it is a reminder that even carefully built research benchmarks are works in progress, and headline percentages on extremely hard problem sets deserve a little caution.
Benchmarks aside, the more striking evidence came from working mathematicians using the system on genuinely unsolved problems. Marc Lackenby, a topologist at the University of Oxford, used it to resolve Problem 21.10 from the Kourovka Notebook, an open compendium of group theory problems maintained in Novosibirsk since 1965.[2][4] The problem asked whether every finite group admits a "just finite presentation," a finite presentation in which removing any single relation makes the group infinite. The answer turned out to be yes.[4]
The path to the answer is the interesting part. The system's first proof attempt was flawed, and its reviewer agent caught the flaw rather than passing it off as correct. Lackenby then noticed a genuinely clever strategy buried inside the failed attempt and used it to repair the argument.[2][4] That is collaboration in a fairly literal sense. The machine did not hand over a finished theorem; it produced a flawed draft whose internal ideas were good enough for an expert to build on. Lackenby also remarked that the system works best when the user already knows the area well.[2]
Other testers used it on different fronts. Gergely Bérczi worked on log-concavity and positivity conjectures for Stirling coefficients of symmetric power representations, a problem he had previously attempted with AlphaEvolve. Semon Rezchikov applied it to a technical subproblem about perturbations of a class of Hamiltonian diffeomorphisms.[2]
DeepMind has been chipping away at AI for mathematics for years, and the co-mathematician sits on top of that lineage rather than replacing it. Earlier systems were narrow specialists. AlphaProof produces formally verified proofs, and AlphaGeometry solves olympiad geometry. Both were impressive but tightly scoped, aimed at competition-style problems with clean statements.[7]
The co-mathematician is a different kind of thing. It is a general workbench for the full, untidy arc of research, and it treats the older specialists as tools it can reach for. The paper notes that a formal prover such as AlphaProof could be deployed dynamically inside the interactive loop to raise confidence in a result, and that AlphaEvolve inspired its evolutionary iterators for algorithmic search.[2] So the trajectory looks less like one system superseding another and more like DeepMind stacking narrow capabilities under a broader, human-facing agentic layer.
I keep coming back to the Lackenby story, because it reframes what "AI solving a problem" even means. The system did not solve Problem 21.10 by itself. It produced a wrong proof with a right idea inside it, and a human expert did the rest. That is a more modest and probably more honest picture of where this technology is than the headline of a 60-year-old problem falling.
What seems genuinely new is the format. Treating mathematics as a long-horizon, auditable, multi-agent process, with persistent reviewers and traceable artifacts, is a real departure from one-shot chatbot answers, and it is closer to how mathematicians actually work.[2][8] Whether the 48% figure holds up after Epoch AI's benchmark review is a separate question. Either way, the more durable contribution may be the workbench idea: AI as a research collaborator that argues, gets things wrong, gets caught, and occasionally leaves behind an idea worth keeping.