AI co-scientist (Google)

AI Agents AI for Science Google

7 min read

Updated Jun 3, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 3, 2026

Fact-checked

In review queue

Sources

9 citations

Revision

v1 · 1,376 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

The AI co-scientist is a multi-agent artificial intelligence system from Google, built on the Gemini 2.0 family of models, that is designed to help expert researchers generate novel scientific hypotheses, research proposals, and experimental plans. Google Research unveiled it on February 19, 2025, presenting it as a collaborative tool intended to work alongside scientists rather than replace them, with an initial focus on biomedicine. The accompanying technical report, "Towards an AI co-scientist," was posted to arXiv on February 26, 2025 by Juraj Gottweis, Wei-Hung Weng, and colleagues.^[1]^[2]

The system was positioned by Google as a research prototype rather than a finished product. It was made available to a small set of organizations through a Trusted Tester program, and Google described it as a step toward "AI-assisted" rather than autonomous science.^[1]^[3]

Background

The AI co-scientist emerged from work spanning Google Research, Google Cloud, and Google DeepMind, with external collaborators including the Fleming Initiative and Imperial College London, Stanford University School of Medicine, and Houston Methodist.^[1]^[4] Its stated motivation is the gap between the exponential growth of published research and a scientist's limited capacity to read, synthesize, and connect findings across disciplines. Google framed the tool as a way to surface non-obvious connections and to compress the early, exploratory phase of hypothesis generation.^[1]

The design draws explicitly on the Gemini line of large language models and on reasoning techniques that allocate additional computation at inference time. Google describes the system as "designed to mirror the reasoning process underpinning the scientific method," using a structured loop of proposal, critique, and refinement.^[1]^[2]

Multi-agent architecture

Rather than relying on a single model call, the AI co-scientist coordinates several specialized agents, each built on Gemini 2.0 and each handling a distinct part of the reasoning process. A scientist supplies a research goal in natural language, optionally with constraints, preferences, and prior ideas, and the agents work together to produce ranked, reviewed hypotheses.^[1]^[2]

Agent	Role
Supervisor	Coordinates the other agents, parses the research goal into tasks, allocates compute, and decides when the process is complete
Generation	Produces initial hypotheses, searches the literature for relevant articles, and identifies testable assumptions
Reflection	Reviews proposals for correctness, quality, novelty, and safety, including simulated critique of proposed experiments
Ranking	Runs an Elo-style tournament of pairwise comparisons, using simulated scientific debate to decide which hypothesis is stronger
Proximity	Computes similarity between hypotheses to cluster related ideas and reduce redundancy
Evolution	Refines and combines top-ranked hypotheses, simplifies them, or generates divergent alternatives
Meta-review	Synthesizes patterns from reviews and debates, then feeds that summary back to improve later iterations

The agents run within what the paper calls an asynchronous task execution framework, which lets the system scale the amount of computation flexibly depending on the difficulty of the goal.^[2]^[4]

How it works (the tournament loop)

The core method is summarized by Google as a "generate, debate, and evolve" approach. The Generation agent first drafts candidate hypotheses. The Reflection agent filters and critiques them. The Ranking agent then stages a tournament in which pairs of hypotheses are compared head to head, with the Ranking agent simulating a debate between them and awarding the winner Elo points while the loser drops. The Evolution agent improves the strongest survivors, and the cycle repeats, producing what Google describes as a self-improving loop of increasingly high-quality outputs.^[1]^[2]

This design leans on test-time compute scaling: instead of producing one answer, the system spends more inference compute iterating, debating, and self-critiquing. Google reported that hypothesis quality, as measured by the internal Elo metric, continued to improve as more compute was applied. The team also reported that higher Elo ratings correlated with a higher probability of correct answers on the GPQA "diamond" set, a benchmark of difficult graduate-level science questions, which they used as evidence that the auto-evaluation signal tracks real quality.^[2]^[5]

The Elo tournament idea echoes self-play methods used in game-playing systems, where agents improve by competing against each other. Here the "game" is scientific argument, and the ranking emerges from many simulated debates rather than a single judgment.^[2]^[3]

Validation examples

Google tested the system in three biomedical areas and reported wet-lab follow-up for each. These results were presented as validations of an early prototype, and several were partial or based on cell-line experiments rather than clinical outcomes.^[1]^[2]

Drug repurposing for acute myeloid leukemia (AML). The system proposed existing drugs that might be repurposed against AML. In laboratory testing, three of five suggested compounds inhibited the viability of AML cell lines at clinically relevant concentrations.^[1]^[6]
Novel target discovery for liver fibrosis. The system proposed epigenetic targets, and follow-up experiments in human hepatic organoids reported anti-fibrotic activity; in one account, two of three tested treatments showed significant effect without observed toxicity.^[2]^[6]
Mechanisms of antimicrobial resistance. Working with researchers at Imperial College London and the Fleming Initiative, the system was asked why certain mobile genetic elements spread between bacterial species. It independently proposed that capsid-forming phage-inducible chromosomal islands (cf-PICIs) can interact with the tails of diverse phages to broaden their host range. This matched a hypothesis the lab, led by José Penadés, had already reached through roughly a decade of experimental work but had not yet published. The system reportedly reproduced the conclusion within about two days and offered additional plausible hypotheses.^[1]^[7]

The antimicrobial-resistance case drew the most attention in the press, often summarized as the AI "solving in two days" a problem that had taken scientists years.^[7]

Limitations and reception

Coverage was a mix of enthusiasm and caution. Because the cf-PICI hypothesis had already been established experimentally by the Imperial College team before Google ran the test, commentators noted that the result is better described as a re-discovery or independent corroboration than as a wholly new finding the AI uncovered on its own. The lab's earlier, unpublished work meant the conclusion existed before the system was prompted, even though that work was not in the public literature the model could have read.^[7]^[8]

A detailed critique by the DrugDiscovery.NET blog argued that none of the three cases amounted to a clinical "breakthrough." It pointed out that the AML results were measured in cell lines rather than tumors or living organisms, that several relevant compounds had known prior activity in similar settings, and that key liver-fibrosis details were described as forthcoming rather than fully disclosed at announcement. The author called the engineering "technically fascinating" while concluding that the drug-discovery validation fell short of demonstrated novelty.^[8]

Google itself listed limitations, including the need for stronger literature review, better factuality checking, integration with external verification tools, and larger evaluations involving more subject-matter experts. Observers also stressed the framing in the name: it is a co-scientist meant to augment human researchers who design and run the confirming experiments, not an autonomous "AI scientist."^[1]^[3]^[9]

Availability

At launch the AI co-scientist was not a public product. Google opened a Trusted Tester program for research organizations, describing a community that ranged from PhD students to industry researchers and Nobel laureates, intended to stress-test the system on real problems. The tool was presented as part of a broader Google effort to apply Gemini to scientific research.^[1]^[3] Google later expanded access and folded co-scientist into wider "Gemini for Science" and enterprise R&D offerings, but the original February 2025 announcement concerned the research prototype and its Trusted Tester pipeline.^[9]

References

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

Agent2Agent Protocol

Background

Multi-agent architecture

How it works (the tournament loop)

Validation examples

Limitations and reception

Availability

References

Improve this article

Related Articles

DolphinGemma

AlphaEvolve

FutureHouse

Agent2Agent Protocol

Jules (Google)

Gemini CLI

What links here

Related Articles

DolphinGemma

AlphaEvolve

FutureHouse

Agent2Agent Protocol

Jules (Google)

Gemini CLI