AlphaCode 2

AI Code Generation Google DeepMind

7 min read

Updated Jun 3, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 3, 2026

Fact-checked

In review queue

Sources

5 citations

Revision

v1 · 1,426 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

AlphaCode 2 is a competitive-programming system built by Google DeepMind that uses a fine-tuned version of the Gemini family of language models to generate, filter, and rank candidate solutions to algorithmic contest problems. It was unveiled on December 6, 2023, the same day Google announced Gemini, and is the successor to AlphaCode, the 2022 system that first reached the level of a median human competitor ^[1]^[2]. AlphaCode 2 was released as a research result rather than a product. DeepMind reported that on Codeforces, the platform used to evaluate the original system, AlphaCode 2 solved close to twice as many problems and performed better than an estimated 85 percent of contestants ^[3].

Background (AlphaCode)

The original AlphaCode, described by Yujia Li and colleagues in a 2022 paper in Science, was the first AI system to perform at a competitive level on programming contests. It worked by generating very large numbers of candidate programs and then narrowing them down through filtering and clustering before submitting a small selection. In simulated evaluations on recent Codeforces contests it placed in roughly the top 54 percent of participants, which DeepMind characterized as the level of a median competitor ^[4]. AlphaCode was a research demonstration and was never released as a product ^[3].

Competitive programming is a demanding test for code-generating systems because the problems are open-ended. Before any code is written, a solver has to read a natural-language description, reason about it, and design an algorithm that fits tight time and memory limits. DeepMind frames this as a benchmark for advanced reasoning rather than ordinary software engineering, and it is one reason general-purpose models had performed poorly on contest problems ^[1].

The system (Gemini-based pipeline)

AlphaCode 2 keeps the overall shape of its predecessor but rebuilds every component on top of Gemini Pro, the mid-tier model in Google's first Gemini generation. DeepMind describes adopting Gemini as the foundation model for all components as the key change behind the improved performance ^[1].

The starting point is the Gemini Pro model, to which the team applies two consecutive rounds of fine-tuning using the GOLD training objective. The first round uses an updated version of the CodeContests dataset, containing roughly 15,000 problems and 30 million human code samples; a second round adds a smaller, higher-quality dataset. Rather than relying on a single network, the process produces a family of fine-tuned policy models, because using several models with varied hyperparameters increases the diversity of generated code, which matters for solving hard problems ^[1].

At inference time the system runs a multi-stage search and reranking pipeline ^[1]^[2]:

Sampling. The policy models generate up to a million code samples per problem. A randomized temperature is assigned to each sample to encourage diversity, and targeted metadata such as the problem's difficulty rating and tags is randomized in the prompt. Unlike the original system, which sampled in both Python and C++, AlphaCode 2 sampled only in C++ because the team found those samples to be higher quality.
Filtering. Each problem ships with at least one public input/output test. Every sample is executed on that test, and any program that does not produce the expected output, along with the fewer than 5 percent of samples that fail to compile, is discarded. On average this step removes about 95 percent of candidates.
Clustering. Around 50,000 candidates typically survive filtering, far more than can be submitted. A separate model generates new test inputs; running the remaining samples on those inputs produces output signatures that are used to group semantically similar programs into clusters. The clusters are then ordered by size and the ten largest are kept, so that near-duplicate solutions are not submitted repeatedly.
Scoring and selection. A second Gemini Pro model, fine-tuned to predict a correctness score between 0 and 1, rates the surviving samples. The highest-scoring candidate from each of the ten retained clusters forms the final list of at most ten submissions ^[1].

Results on Codeforces

DeepMind evaluated AlphaCode 2 on Codeforces, the same platform used for the original system. The team selected 12 recent contests with more than 8,000 participants each, drawn from Division 2 or the harder combined "1+2" division, giving a total of 77 problems. For each problem the system sampled one million candidates and submitted up to ten solutions until one was correct or the candidates ran out ^[1].

AlphaCode 2 solved 43 percent of these problems, close to a 2x improvement over the original AlphaCode's 25 percent on the same benchmark. Mapping that to contest rankings, DeepMind estimated that AlphaCode 2 sits at the 85th percentile on average, placing it between the Codeforces "Expert" and "Candidate Master" tiers and ahead of about 85 percent of entrants. The original AlphaCode was estimated at roughly the 46th percentile on this comparison. In the two contests where it did best, AlphaCode 2 outperformed more than 99.5 percent of participants ^[1].

Metric	Original AlphaCode	AlphaCode 2
Problems solved (within 10 submissions, 77-problem set)	25%	43%
Estimated Codeforces percentile	~46th	~85th
Codeforces tier (approx.)	below median	Expert to Candidate Master
Best-case contests	not reported	>99.5th percentile
Sampling languages	Python and C++	C++ only

The report also measured how performance scaled with the number of samples. As with the original system, the solve rate rose roughly log-linearly with more samples, and AlphaCode 2 needed only about 100 samples per problem to match the level the original AlphaCode reached with a million. DeepMind described this as making the new system over 10,000 times more sample efficient. In an additional "AlphaCode 2 plus human" setting, where a person specifies extra filtering properties, the combined system scored above the 90th percentile ^[1].

Comparison to AlphaCode

The two systems share a search-and-rerank philosophy, but the upgrades are substantial. The most visible change is the foundation model: where AlphaCode used a purpose-built encoder-decoder transformer, AlphaCode 2 fine-tunes Gemini Pro for both code generation and the scoring step, and DeepMind credits Gemini's flexibility for the gains on those two very different tasks ^[1]. The headline accuracy nearly doubled, from solving 25 percent of the benchmark problems to 43 percent, and the estimated ranking jumped from around the median to the 85th percentile ^[1]^[3].

Sample efficiency is the other large difference. Both systems can draw on up to a million samples per problem, but AlphaCode 2 reaches its predecessor's performance with roughly 100, an improvement of more than four orders of magnitude ^[1]. The pipeline was also streamlined to sample only in C++ and to use a learned scoring model on top of the clustering stage. The numbers reported in DeepMind's abstract and body text differ slightly in framing: the abstract states AlphaCode 2 "solved 1.7x more problems," while the evaluation section reports the 43 percent versus 25 percent solve rates as a "close to 2x" improvement. Both figures come from the same technical report ^[1].

Status and significance

AlphaCode 2 was presented as a research milestone, not a launched product, and it was not made publicly available. DeepMind was explicit about its limits, writing that the system "requires a lot of trial and error, and remains too costly to operate at scale," and that it depends heavily on being able to filter out obviously bad samples ^[1]. The company said it was working toward bringing AlphaCode 2's capabilities into its foundation Gemini models as a step toward a more interactive style of programming, a point echoed by DeepMind vice president Eli Collins around the launch ^[3].

The result drew attention as one of the more concrete demonstrations released alongside Gemini, showing that a general-purpose model could be specialized to a hard reasoning task and beat a bespoke predecessor ^[3]^[5]. DeepMind suggested that an even stronger foundation model, such as Gemini Ultra, would likely push the approach further ^[1]. The contest problem featured in the public Gemini demonstration came from the CodeTON Round 4 contest and was used with permission from Codeforces ^[1].

References

AlphaCode Team, Google DeepMind. "AlphaCode 2 Technical Report." December 6, 2023. storage.googleapis.com ↩
Kyle Wiggers. "Google unveils AlphaCode 2, powered by Gemini." TechCrunch, December 6, 2023. techcrunch.com ↩
Thomas Claburn. "AlphaCode 2, a code-generating AI revamped with Gemini." The Register, December 7, 2023. theregister.com ↩
Yujia Li et al. "Competition-level code generation with AlphaCode." *Science*, 2022. science.org ↩
"Introducing Gemini: Google's most capable AI model yet." Google, December 6, 2023. blog.google ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

AlphaCode Programming

Background (AlphaCode)

The system (Gemini-based pipeline)

Results on Codeforces

Comparison to AlphaCode

Status and significance

References

Improve this article

Related Articles

AlphaCode

DQN

ERQA

PaLM-E: An Embodied Multimodal Language Model

SmolVLA

Gemini (language model)

What links here

Related Articles

AlphaCode

DQN

ERQA

PaLM-E: An Embodied Multimodal Language Model

SmolVLA

Gemini (language model)

What links here