AlphaCode

AI Code Generation Artificial Intelligence Google DeepMind Large Language Models

19 min read

Updated Jun 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 23, 2026

Fact-checked

In review queue

Sources

8 citations

Revision

v10 · 3,855 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

AlphaCode is an artificial intelligence system developed by Google DeepMind that generates computer programs capable of solving competitive programming problems at a human-competitive level, and it was the first AI to rank at roughly the median level (top 54.3%) of human competitors on the Codeforces platform.^[1] First announced on February 2, 2022 and formally published in Science on December 8, 2022, AlphaCode works by sampling up to one million candidate programs per problem and then filtering and clustering them down to just 10 submissions.^[1]^[4] DeepMind described the milestone plainly: AlphaCode "placed at about the level of the median competitor, marking the first time an AI code generation system has reached a competitive level of performance in programming competitions."^[4] An upgraded successor, AlphaCode 2, was released on December 6, 2023, using Google's Gemini Pro model to reach the 85th percentile among competitive programmers.^[3]

AlphaCode's approach differs fundamentally from typical code-completion tools. Rather than generating a single best-guess solution, AlphaCode produces up to one million candidate programs per problem, then applies filtering, clustering, and selection to narrow the field down to just 10 submissions.^[1] This generate-and-test strategy demonstrated that large language models could tackle problems requiring both algorithmic reasoning and creative problem-solving.^[1]

What problem was AlphaCode built to solve?

Competitive programming involves solving well-defined algorithmic problems under time constraints, where solutions must pass hidden test cases to be accepted. Platforms like Codeforces host regular contests in which thousands of participants compete to solve problems that test skills in areas such as dynamic programming, graph theory, number theory, and combinatorics. These problems are significantly harder than the code-completion tasks that earlier AI coding tools were designed for, because they require understanding a natural language problem description, devising an algorithmic strategy, and implementing a correct solution from scratch.^[1]

Before AlphaCode, AI systems such as OpenAI's Codex (the model behind GitHub Copilot) had demonstrated strong performance on code completion and simple programming tasks.^[6] However, these systems performed poorly on competitive programming challenges. Codex and GPT-3, when evaluated on Codeforces-style problems, solved very few problems even with multiple attempts.^[1] The gap between auto-completing code from context and independently solving novel algorithmic challenges remained wide.

DeepMind set out to close this gap by building a system specifically designed for competitive programming, combining large-scale transformer models with a novel pipeline for generating, filtering, and selecting solutions.^[1]

The Codeforces rating system provides a useful framework for understanding competitive programming skill levels. Ratings below 1,200 correspond to "Newbie" level; 1,200 to 1,399 is "Pupil"; 1,400 to 1,599 is "Specialist"; 1,600 to 1,899 is "Expert"; 1,900 to 2,099 is "Candidate Master"; 2,100 to 2,299 is "Master"; 2,300 to 2,399 is "International Master"; 2,400 to 2,599 is "Grandmaster"; 2,600 to 2,999 is "International Grandmaster"; and 3,000 and above is "Legendary Grandmaster." Only a handful of human competitors worldwide hold ratings above 3,000.

AlphaCode (Version 1)

When was AlphaCode released and published?

DeepMind published a blog post and preprint about AlphaCode on February 2, 2022.^[4] The preprint appeared on arXiv in March 2022 (arXiv:2203.07814).^[2] The peer-reviewed paper, titled "Competition-level code generation with AlphaCode," was published in Science on December 8, 2022 (Volume 378, Issue 6624, pages 1092-1097).^[1] The lead authors were Yujia Li, David Choi, and Junyoung Chung, with over 40 co-authors including Oriol Vinyals, Nando de Freitas, and Koray Kavukcuoglu.^[1]

How is AlphaCode's architecture designed?

AlphaCode uses an encoder-decoder transformer architecture.^[1] Unlike decoder-only models such as GPT, which process input and output in a single left-to-right stream, AlphaCode's encoder-decoder design allows the encoder to build a bidirectional representation of the problem description while the decoder generates code autoregressively, one token at a time.^[1]

The architecture has several distinctive features:

Asymmetric token lengths: The encoder processes up to 1,536 tokens of the problem description, while the decoder generates sequences of up to 768 tokens. This asymmetry reflects the fact that competitive programming problem descriptions tend to be roughly twice as long as their solutions.^[1]
Shallow encoder, deep decoder: The encoder uses fewer transformer layers than the decoder. This design significantly improved training efficiency without reducing the solve rate, because the decoder performs the more computationally demanding task of generating code token by token.^[1]
Multi-query attention: Rather than maintaining separate key and value heads for every attention head, AlphaCode shares key and value heads within each attention block while keeping a full set of query heads. This dramatically reduces memory usage and speeds up the sampling process, which is critical when generating millions of candidate solutions.^[1]
SentencePiece tokenizer: A vocabulary of 8,000 tokens was used to handle both natural language problem descriptions and source code in multiple programming languages.^[1]

The team trained models at five different scales:

Model Size	Parameters
Small	300M
Medium	1.1B
Large	2.8B
XL	8.7B
XXL	41.1B

The final evaluation used an ensemble that pooled samples from the 41.1B and 8.7B parameter models.^[1]

How was AlphaCode trained?

AlphaCode's training proceeds in two stages: pre-training on a large corpus of general code, followed by fine-tuning on competitive programming data.^[1]

Pre-training: The models were pre-trained on a 715.1 GB snapshot of publicly available code from GitHub, spanning 12 programming languages.^[1] The training objective combined standard cross-entropy next-token prediction loss for the decoder with masked language modeling loss for the encoder. The initial learning rate was set to 10^-4 and decayed following a cosine schedule to 10^-5, with global gradient norms clipped to 1.0.^[1]

Fine-tuning: After pre-training, the models were fine-tuned on the CodeContests dataset (described below).^[1] The fine-tuning process incorporated three important techniques:

Tempering: The output logits were divided by a temperature of T = 0.2 during training to sharpen the probability distribution, encouraging the model to commit more confidently to specific token predictions.^[1]
GOLD (Generalized Offline Learning from Demonstrations): An offline reinforcement learning algorithm that balances precision and recall. Because competitive programming problems can have many valid solutions, GOLD helps the model learn from the diversity of correct approaches rather than overfitting to a single solution style.^[1]
Value conditioning: Correctness labels were inserted into the training data so the model could learn to distinguish between correct and incorrect solution patterns. A value prediction auxiliary task further reinforced this signal.^[1]

Ablation studies showed that each of these techniques contributed meaningfully to performance. Starting from a 1.1B base model achieving a 15.2% solve rate at 10 submissions from 100,000 samples, cumulative improvements brought the rate to 24.1%: masked language modeling (+1.8%), tempering (+1.7%), metadata tags and ratings (+0.6%), value conditioning (+0.9%), GOLD (+1.3%), and clustering (+2.6%).^[1]

What is the CodeContests dataset?

A key contribution of the AlphaCode project was the creation and public release of the CodeContests dataset, a benchmark for competitive programming code generation.^[1]^[5] The dataset aggregates problems from multiple competitive programming platforms, including Codeforces, AtCoder, CodeChef, Aizu, and HackerEarth, along with data from the Description2Code and CodeNet collections.^[1]

Split	Number of Problems	Avg. Submissions per Problem
Training	13,328	922.4
Validation	117	-
Test	165	-

The dataset uses a temporal split to prevent data leakage: all training problems were published before July 14, 2021; validation problems date from July 15 to September 20, 2021; and test problems come from September 21, 2021 onward.^[1]

A critical feature of CodeContests is its extensive set of generated test cases. The authors found that existing competitive programming datasets suffered from a high false positive rate, where incorrect solutions could pass the provided tests by chance. By generating additional tests using a separate transformer model trained for this purpose, the false positive rate dropped from 62% to 4%.^[1]

The CodeContests dataset is publicly available on GitHub (google-deepmind/code_contests) and Hugging Face.^[5]

How does the generate-and-test pipeline work?

AlphaCode's core innovation lies in its generate-and-test pipeline, which compensates for the imperfect accuracy of any single generated solution by producing a very large number of candidates and then selecting the most promising ones.^[1]

Step 1: Large-scale sampling. For each problem, AlphaCode generates up to one million candidate solutions, split evenly between Python and C++. During sampling, the model is conditioned on randomized metadata (such as problem tags and difficulty ratings ranging from 800 to 3,500) to encourage diversity. The sampling temperature is set to 0.25 when using GOLD-trained models and 0.12 for tempering-only models. Standard sampling (rather than top-k or nucleus sampling) is used to maintain maximum diversity.^[1]

Step 2: Filtering. Each generated program is executed against the example test cases provided in the problem statement. Programs that fail any example test are discarded. This step alone removes approximately 99% of all generated samples, typically leaving tens of thousands of candidates.^[1]

Step 3: Test input generation. A separate transformer model, trained specifically for this purpose, generates additional synthetic test inputs for the problem. These inputs are designed to exercise edge cases and unusual scenarios that the example tests might not cover.^[1]

Step 4: Clustering. The surviving candidate programs are executed on the synthetic test inputs. Programs that produce identical outputs for all synthetic inputs are grouped into the same cluster, under the assumption that programs behaving identically are likely implementing the same underlying algorithm. This behavioral clustering reduces tens of thousands of programs into a smaller number of distinct solution strategies.^[1]

Step 5: Selection. One representative program is selected from each of the 10 largest clusters, yielding the final set of 10 submissions. The rationale is that the largest clusters represent the most common solution strategies, which are statistically more likely to be correct.^[1]

How well did AlphaCode perform on Codeforces?

AlphaCode was evaluated through simulated participation in 10 recent Codeforces contests, each with more than 5,000 human participants. The system was allowed 10 submissions per problem, matching the rules of Codeforces contests.^[1]

Metric	Result
Average ranking among participants	Top 54.3%
Estimated Codeforces Elo rating	~1,238
Solve rate on CodeContests test set (10 submissions from 1M samples)	34.2%
Solve rate (10 submissions from 100K samples)	29.6%
Solve rate (10 submissions from 10K samples)	23.2%

The estimated Elo rating of approximately 1,238 placed AlphaCode in the upper range of the "Pupil" category on Codeforces (ratings 1,200-1,399). While this is far from the level of top competitive programmers (whose ratings exceed 3,000), it represented the first time any AI system had reached a competitive level on this platform.^[1]

The system showed varying performance across problem categories. It performed best on problems involving bitmasks (33.8% solve rate) and sorting (25.5%), while struggling more with dynamic programming problems (8.8%).^[1]

What does AlphaCode not do?

The AlphaCode team conducted analyses to verify that the system was not simply memorizing and regurgitating solutions from its training data. They found that AlphaCode's generated solutions shared substrings with training data at roughly the same rate as human-written solutions did, suggesting genuine generation rather than memorization. The proportion of dead code (unreachable or unused code) in AlphaCode's solutions was also comparable to that in human solutions.^[1]

Sensitivity tests showed that AlphaCode was responsive to meaningful changes in problem descriptions (producing different solutions for modified problems) but robust to trivial variations in phrasing, indicating that it was processing the semantic content of the problem rather than matching surface patterns.^[1]

The authors also noted an interesting finding about validation loss during fine-tuning. As fine-tuning progressed, validation loss actually increased, even as the solve rate improved. This means that validation loss is a poor proxy for competitive programming performance, and the team had to rely on solve rate evaluations rather than standard loss metrics to track progress during training.^[1]

What is AlphaCode 2 and how does it differ?

Overview

AlphaCode 2 was announced on December 6, 2023, alongside the launch of Google's Gemini model family.^[3] The AlphaCode 2 Technical Report was published by the AlphaCode Team at Google DeepMind.^[3] The system represents a significant leap over the original AlphaCode, both in raw performance and in computational efficiency.^[3]

Where AlphaCode 1 used a custom encoder-decoder transformer trained from scratch, AlphaCode 2 is built on top of Gemini Pro, a large multimodal language model developed by Google.^[3] This foundation provides AlphaCode 2 with stronger reasoning capabilities and a richer understanding of both natural language and code.^[3]

Architecture and Training

AlphaCode 2 applies two consecutive rounds of fine-tuning to the Gemini Pro base model using the GOLD algorithm.^[3] The first round fine-tunes on an updated version of the CodeContests dataset, which by this point contained approximately 15,000 problems and 30 million human code samples.^[3] The second round further refines the model's ability to generate correct competitive programming solutions.^[3]

The system's pipeline retains the same broad structure as AlphaCode 1 (generate, filter, cluster, select) but adds a learned scoring model:^[3]

Policy models: A family of fine-tuned Gemini Pro models generates code samples for each problem.^[3]
Diverse sampling: Up to one million code samples are generated per problem, with a randomized temperature parameter assigned to each sample to encourage diversity in solution strategies.^[3]
Filtering: Samples are executed against example test cases. Approximately 95% of samples are removed at this stage, leaving an average of roughly 50,000 candidates per problem.^[3]
Clustering: A separate model generates synthetic test inputs for each problem. Surviving samples are executed on these inputs, and programs producing identical output signatures are grouped into clusters.^[3]
Scoring model: A second Gemini Pro model, fine-tuned specifically for this task, assigns each candidate sample a predicted correctness score between 0 and 1. The highest-scoring sample from each of the 10 largest clusters is selected for submission.^[3]

The addition of the learned scoring model is a notable improvement over AlphaCode 1, which simply picked an arbitrary representative from each cluster. By training a dedicated model to predict solution correctness, AlphaCode 2 can make more informed selections within each cluster.^[3]

Google DeepMind noted that an even more powerful version could potentially be built using Gemini Ultra, a larger and more capable model than Gemini Pro. However, the AlphaCode 2 results were published using only the Gemini Pro foundation, leaving room for future improvements with stronger base models.^[3]

How well did AlphaCode 2 perform?

AlphaCode 2 was evaluated on 12 recent Codeforces contests, each with more than 8,000 participants, for a total of 77 problems.^[3] Its performance represents a substantial improvement over the original system.

Metric	AlphaCode 1	AlphaCode 2
Problems solved (relative)	Baseline	~1.7x more
Problems solved within 10 attempts	25%	43%
Codeforces percentile	~54th (top 54%)	~85th (top 15%)
Estimated Codeforces rating	~1,238	~1,650
Codeforces rank category	Pupil	Expert to Candidate Master

AlphaCode 2 solved 43% of competition problems within 10 attempts, compared to 25% for the original AlphaCode, and reached the 85th percentile on average, performing better than 85% of entrants.^[3] The estimated Codeforces rating of approximately 1,650 places the system between the "Expert" (1,600-1,899) and "Candidate Master" (1,900-2,099) levels.^[3]

Sample Efficiency

One of AlphaCode 2's most striking improvements is its sample efficiency. The system is reported to be more than 10,000 times more sample-efficient than the original AlphaCode, meaning it can achieve comparable performance with far fewer generated candidates.^[3] In practice, AlphaCode 2 can match AlphaCode 1's performance using only about 100 generated samples, compared to the one million samples that AlphaCode 1 required.^[3] This improvement stems from the stronger base model (Gemini Pro), better fine-tuning, and the learned scoring mechanism.^[3]

How does AlphaCode compare with other code generation systems?

AlphaCode occupies a distinct position in the landscape of AI code generation. The following table compares AlphaCode with other notable systems.

System	Developer	Year	Approach	Codeforces Elo (est.)	Codeforces Percentile
AlphaCode	DeepMind	2022	Encoder-decoder, generate 1M + filter/cluster	~1,238	~54th
AlphaCode 2	Google DeepMind	2023	Gemini Pro fine-tuned, generate + score/cluster	~1,650	~85th
Codex	OpenAI	2021	Decoder-only (GPT-3 variant), single-pass generation	Not competitive	Very low
GPT-4	OpenAI	2023	Decoder-only, direct generation	~392	Below 5th
o1	OpenAI	2024	Reasoning model with chain-of-thought	~1,673	~89th
o3	OpenAI	2024	Reasoning model, self-verification	~2,727	~99.8th

Several key differences stand out:

AlphaCode vs. Codex and GPT-4: Codex and GPT-4 are general-purpose code generation models optimized for broad programming assistance, including code completion, debugging, and explanation. They generate solutions in a single pass without the massive sampling and filtering pipeline that AlphaCode employs. On competitive programming tasks, GPT-4 scored an estimated Codeforces rating of only 392 (Newbie level, below the 5th percentile), while AlphaCode reached roughly 1,238.^[7] The difference highlights that competitive programming requires a fundamentally different approach than general code assistance.

AlphaCode vs. OpenAI o1 and o3: OpenAI's o-series reasoning models, released in 2024 and 2025, represent a different paradigm. Rather than generating millions of candidates and filtering externally, o1 and o3 use extended chain-of-thought reasoning to work through problems step by step. The o1 model achieved an estimated Codeforces rating of 1,673 (89th percentile), comparable to AlphaCode 2.^[7] The o3 model pushed further, reaching approximately 2,727 (99.8th percentile), roughly equivalent to the 175th best human competitor globally.^[7] Notably, o3 achieved this with far fewer samples (around 1,000 per problem) and without hand-crafted domain-specific strategies, relying instead on self-verification by writing and running brute-force solutions to cross-check its optimized implementations.^[7]

AlphaCodium: Inspired by AlphaCode's generate-and-test philosophy, CodiumAI released AlphaCodium in January 2024, an open-source tool that uses a "flow engineering" approach to iterative code generation and testing.^[8] AlphaCodium showed that the core insight behind AlphaCode (generating multiple candidates and testing them) could be applied more broadly, even without the massive computational resources that DeepMind's system requires.^[8]

What are the limitations and criticisms of AlphaCode?

AlphaCode's generate-and-test approach, while effective, has drawn several criticisms:

Computational cost: Generating up to one million candidate solutions per problem is extremely resource-intensive. The sampling, execution, and clustering pipeline requires substantial compute infrastructure, making it impractical for real-time or interactive use. AlphaCode 2 improved efficiency significantly, but the approach remains far more expensive per problem than single-pass generation.^[3]

Scaling concerns: Computer scientist Ernest Davis observed that the system has "a substantial component of monkeys typing Hamlet." AlphaCode needed one million samples to achieve a 34% solve rate on problems with solutions averaging around 20 lines.^[1] Critics have noted that the number of required samples could grow exponentially for longer and more complex programs, potentially limiting the approach's applicability beyond competitive programming.

Lack of genuine understanding: Unlike human programmers who use intuition, debugging, and iterative refinement to converge on a solution, AlphaCode relies on statistical coverage of the solution space. It does not debug its own code or reason about why a particular approach fails. The system's success comes from generating enough diverse candidates that at least one is likely correct, rather than from a deep understanding of the problem.

Dataset dependency: The filtering pipeline relies on having high-quality test cases to evaluate candidate solutions. In competitive programming, problems come with example tests and hidden test suites, making this approach natural. For general software engineering tasks, where requirements are often ambiguous and test coverage is incomplete, the generate-and-test strategy may be less effective.

Limited problem complexity: AlphaCode's performance dropped sharply on harder problem categories. Its 8.8% solve rate on dynamic programming problems (compared to 33.8% on bitmask problems) suggests that the system struggles with problems requiring multi-step reasoning and complex state management.^[1]

Impact and Legacy

AlphaCode's contributions to the field extend beyond its competitive programming results:

Demonstrating the generate-and-test paradigm: AlphaCode showed that for hard problems where single-pass generation fails, generating many candidates and filtering them based on execution behavior can be highly effective.^[1] This insight has influenced subsequent work on test-time compute scaling and code verification.

The CodeContests benchmark: The publicly released CodeContests dataset has become a standard benchmark for evaluating code generation systems on competitive programming tasks.^[5] Its rigorous test suite and temporal splits address many weaknesses of earlier benchmarks.

Advancing code generation research: By demonstrating that AI could reach human-competitive levels on a challenging and well-defined task, AlphaCode helped catalyze a wave of research and commercial interest in AI-assisted programming. Systems like GitHub Copilot, Cursor, and Devin owe part of their momentum to the broader attention that AlphaCode brought to the field.

Bridging to reasoning models: AlphaCode 2's integration with Gemini Pro foreshadowed the trend of building specialized capabilities on top of large foundation models.^[3] The subsequent success of reasoning-oriented models (such as OpenAI's o-series) on competitive programming tasks suggests that the field is moving toward systems that combine strong base models with structured problem-solving strategies, a direction that AlphaCode helped pioneer.^[7]

Influencing test-time compute research: AlphaCode demonstrated that allocating more computation at inference time (by generating and evaluating many candidates) could substitute for improvements to the base model itself.^[1] This trade-off between training-time and test-time compute has become a central theme in AI research, with subsequent systems exploring various ways to spend additional computation during inference to improve output quality. The concept of scaling test-time compute, which gained significant traction in 2024 and 2025, traces part of its lineage to the large-scale sampling strategies that AlphaCode validated.

Open-source contributions: By releasing the CodeContests dataset and providing detailed descriptions of their methodology, DeepMind enabled the broader research community to build upon their work.^[5] The dataset has been downloaded thousands of times from Hugging Face and GitHub, and it has been used as a benchmark in dozens of subsequent papers on code generation and program synthesis.

Timeline

Date	Event
February 2, 2022	DeepMind announces AlphaCode via blog post and preprint
March 2022	Preprint posted to arXiv (2203.07814)
December 8, 2022	Peer-reviewed paper published in Science (Vol. 378, pp. 1092-1097)
December 6, 2023	AlphaCode 2 Technical Report released alongside Gemini launch

References

Li, Y., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., ... & Vinyals, O. (2022). Competition-level code generation with AlphaCode. *Science*, 378(6624), 1092-1097. https://www.science.org/doi/10.1126/science.abq1158 ↩
Li, Y., et al. (2022). Competition-Level Code Generation with AlphaCode. *arXiv preprint arXiv:2203.07814*. https://arxiv.org/abs/2203.07814 ↩
AlphaCode Team, Google DeepMind. (2023). AlphaCode 2 Technical Report. https://storage.googleapis.com/deepmind-media/AlphaCode2/AlphaCode2_Tech_Report.pdf ↩
Google DeepMind. (2022). Competitive programming with AlphaCode. Blog post. https://deepmind.google/blog/competitive-programming-with-alphacode/ ↩
Google DeepMind. (2023). Code Contests dataset. GitHub repository. https://github.com/google-deepmind/code_contests ↩
Chen, M., et al. (2021). Evaluating Large Language Models Trained on Code. *arXiv preprint arXiv:2107.03374*. ↩
OpenAI. (2025). Competitive Programming with Large Reasoning Models. *arXiv preprint arXiv:2502.06807*. https://arxiv.org/abs/2502.06807 ↩
Ridnik, T., et al. (2024). Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering. CodiumAI. https://www.codium.ai/blog/alphacodium/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

9 revisions by 1 contributors · full history

Suggest edit

What links here

AI Code Generation AlphaCode 2 AlphaEvolve CodeContests Compound AI System HumanEval Oriol Vinyals Programming Pushmeet Kohli Reka AI Scaling Scaling Laws The Stack (BigCode dataset)

What problem was AlphaCode built to solve?

AlphaCode (Version 1)

When was AlphaCode released and published?

How is AlphaCode's architecture designed?

How was AlphaCode trained?

What is the CodeContests dataset?

How does the generate-and-test pipeline work?

How well did AlphaCode perform on Codeforces?

What does AlphaCode not do?

What is AlphaCode 2 and how does it differ?

Overview

Architecture and Training

How well did AlphaCode 2 perform?

Sample Efficiency

How does AlphaCode compare with other code generation systems?

What are the limitations and criticisms of AlphaCode?

Impact and Legacy

Timeline

See Also

References

Improve this article

Related Articles

Claude Sonnet 4.5

AlphaCode 2

SmolVLA

AlphaStar

AlphaZero

Vibe coding

What links here

Related Articles

Claude Sonnet 4.5

AlphaCode 2

SmolVLA

AlphaStar

AlphaZero

Vibe coding

What links here