AlphaCode 2
Last reviewed
Jun 3, 2026
Sources
5 citations
Review status
Source-backed
Revision
v1 · 1,426 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
5 citations
Review status
Source-backed
Revision
v1 · 1,426 words
Add missing citations, update stale details, or suggest a clearer explanation.
AlphaCode 2 is a competitive-programming system built by Google DeepMind that uses a fine-tuned version of the Gemini family of language models to generate, filter, and rank candidate solutions to algorithmic contest problems. It was unveiled on December 6, 2023, the same day Google announced Gemini, and is the successor to AlphaCode, the 2022 system that first reached the level of a median human competitor [1][2]. AlphaCode 2 was released as a research result rather than a product. DeepMind reported that on Codeforces, the platform used to evaluate the original system, AlphaCode 2 solved close to twice as many problems and performed better than an estimated 85 percent of contestants [3].
The original AlphaCode, described by Yujia Li and colleagues in a 2022 paper in Science, was the first AI system to perform at a competitive level on programming contests. It worked by generating very large numbers of candidate programs and then narrowing them down through filtering and clustering before submitting a small selection. In simulated evaluations on recent Codeforces contests it placed in roughly the top 54 percent of participants, which DeepMind characterized as the level of a median competitor [4]. AlphaCode was a research demonstration and was never released as a product [3].
Competitive programming is a demanding test for code-generating systems because the problems are open-ended. Before any code is written, a solver has to read a natural-language description, reason about it, and design an algorithm that fits tight time and memory limits. DeepMind frames this as a benchmark for advanced reasoning rather than ordinary software engineering, and it is one reason general-purpose models had performed poorly on contest problems [1].
AlphaCode 2 keeps the overall shape of its predecessor but rebuilds every component on top of Gemini Pro, the mid-tier model in Google's first Gemini generation. DeepMind describes adopting Gemini as the foundation model for all components as the key change behind the improved performance [1].
The starting point is the Gemini Pro model, to which the team applies two consecutive rounds of fine-tuning using the GOLD training objective. The first round uses an updated version of the CodeContests dataset, containing roughly 15,000 problems and 30 million human code samples; a second round adds a smaller, higher-quality dataset. Rather than relying on a single network, the process produces a family of fine-tuned policy models, because using several models with varied hyperparameters increases the diversity of generated code, which matters for solving hard problems [1].
At inference time the system runs a multi-stage search and reranking pipeline [1][2]:
DeepMind evaluated AlphaCode 2 on Codeforces, the same platform used for the original system. The team selected 12 recent contests with more than 8,000 participants each, drawn from Division 2 or the harder combined "1+2" division, giving a total of 77 problems. For each problem the system sampled one million candidates and submitted up to ten solutions until one was correct or the candidates ran out [1].
AlphaCode 2 solved 43 percent of these problems, close to a 2x improvement over the original AlphaCode's 25 percent on the same benchmark. Mapping that to contest rankings, DeepMind estimated that AlphaCode 2 sits at the 85th percentile on average, placing it between the Codeforces "Expert" and "Candidate Master" tiers and ahead of about 85 percent of entrants. The original AlphaCode was estimated at roughly the 46th percentile on this comparison. In the two contests where it did best, AlphaCode 2 outperformed more than 99.5 percent of participants [1].
| Metric | Original AlphaCode | AlphaCode 2 |
|---|---|---|
| Problems solved (within 10 submissions, 77-problem set) | 25% | 43% |
| Estimated Codeforces percentile | ~46th | ~85th |
| Codeforces tier (approx.) | below median | Expert to Candidate Master |
| Best-case contests | not reported | >99.5th percentile |
| Sampling languages | Python and C++ | C++ only |
The report also measured how performance scaled with the number of samples. As with the original system, the solve rate rose roughly log-linearly with more samples, and AlphaCode 2 needed only about 100 samples per problem to match the level the original AlphaCode reached with a million. DeepMind described this as making the new system over 10,000 times more sample efficient. In an additional "AlphaCode 2 plus human" setting, where a person specifies extra filtering properties, the combined system scored above the 90th percentile [1].
The two systems share a search-and-rerank philosophy, but the upgrades are substantial. The most visible change is the foundation model: where AlphaCode used a purpose-built encoder-decoder transformer, AlphaCode 2 fine-tunes Gemini Pro for both code generation and the scoring step, and DeepMind credits Gemini's flexibility for the gains on those two very different tasks [1]. The headline accuracy nearly doubled, from solving 25 percent of the benchmark problems to 43 percent, and the estimated ranking jumped from around the median to the 85th percentile [1][3].
Sample efficiency is the other large difference. Both systems can draw on up to a million samples per problem, but AlphaCode 2 reaches its predecessor's performance with roughly 100, an improvement of more than four orders of magnitude [1]. The pipeline was also streamlined to sample only in C++ and to use a learned scoring model on top of the clustering stage. The numbers reported in DeepMind's abstract and body text differ slightly in framing: the abstract states AlphaCode 2 "solved 1.7x more problems," while the evaluation section reports the 43 percent versus 25 percent solve rates as a "close to 2x" improvement. Both figures come from the same technical report [1].
AlphaCode 2 was presented as a research milestone, not a launched product, and it was not made publicly available. DeepMind was explicit about its limits, writing that the system "requires a lot of trial and error, and remains too costly to operate at scale," and that it depends heavily on being able to filter out obviously bad samples [1]. The company said it was working toward bringing AlphaCode 2's capabilities into its foundation Gemini models as a step toward a more interactive style of programming, a point echoed by DeepMind vice president Eli Collins around the launch [3].
The result drew attention as one of the more concrete demonstrations released alongside Gemini, showing that a general-purpose model could be specialized to a hard reasoning task and beat a bespoke predecessor [3][5]. DeepMind suggested that an even stronger foundation model, such as Gemini Ultra, would likely push the approach further [1]. The contest problem featured in the public Gemini demonstration came from the CodeTON Round 4 contest and was used with permission from Codeforces [1].