# AlphaCode

> Source: https://aiwiki.ai/wiki/alphacode
> Updated: 2026-06-23
> Categories: AI Code Generation, Artificial Intelligence, Google DeepMind, Large Language Models
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**AlphaCode** is an [artificial intelligence](/wiki/artificial_intelligence) system developed by [Google DeepMind](/wiki/deepmind) that generates computer programs capable of solving competitive programming problems at a human-competitive level, and it was the first AI to rank at roughly the median level (top 54.3%) of human competitors on the [Codeforces](/wiki/codeforces) platform.[1] First announced on February 2, 2022 and formally published in *Science* on December 8, 2022, AlphaCode works by sampling up to one million candidate programs per problem and then filtering and clustering them down to just 10 submissions.[1][4] DeepMind described the milestone plainly: AlphaCode "placed at about the level of the median competitor, marking the first time an AI code generation system has reached a competitive level of performance in programming competitions."[4] An upgraded successor, **[AlphaCode 2](/wiki/alphacode_2)**, was released on December 6, 2023, using [Google](/wiki/google)'s [Gemini](/wiki/gemini) Pro model to reach the 85th percentile among competitive programmers.[3]

AlphaCode's approach differs fundamentally from typical code-completion tools. Rather than generating a single best-guess solution, AlphaCode produces up to one million candidate programs per problem, then applies filtering, clustering, and selection to narrow the field down to just 10 submissions.[1] This generate-and-test strategy demonstrated that [large language models](/wiki/large_language_model) could tackle problems requiring both algorithmic reasoning and creative problem-solving.[1]

## What problem was AlphaCode built to solve?

Competitive programming involves solving well-defined algorithmic problems under time constraints, where solutions must pass hidden test cases to be accepted. Platforms like Codeforces host regular contests in which thousands of participants compete to solve problems that test skills in areas such as [dynamic programming](/wiki/dynamic_programming), graph theory, number theory, and combinatorics. These problems are significantly harder than the code-completion tasks that earlier AI coding tools were designed for, because they require understanding a natural language problem description, devising an algorithmic strategy, and implementing a correct solution from scratch.[1]

Before AlphaCode, AI systems such as [OpenAI](/wiki/openai)'s [Codex](/wiki/openai_codex) (the model behind [GitHub Copilot](/wiki/github_copilot)) had demonstrated strong performance on code completion and simple programming tasks.[6] However, these systems performed poorly on competitive programming challenges. Codex and [GPT-3](/wiki/gpt-3), when evaluated on Codeforces-style problems, solved very few problems even with multiple attempts.[1] The gap between auto-completing code from context and independently solving novel algorithmic challenges remained wide.

DeepMind set out to close this gap by building a system specifically designed for competitive programming, combining large-scale [transformer](/wiki/transformer) models with a novel pipeline for generating, filtering, and selecting solutions.[1]

The Codeforces rating system provides a useful framework for understanding competitive programming skill levels. Ratings below 1,200 correspond to "Newbie" level; 1,200 to 1,399 is "Pupil"; 1,400 to 1,599 is "Specialist"; 1,600 to 1,899 is "Expert"; 1,900 to 2,099 is "Candidate Master"; 2,100 to 2,299 is "Master"; 2,300 to 2,399 is "International Master"; 2,400 to 2,599 is "Grandmaster"; 2,600 to 2,999 is "International Grandmaster"; and 3,000 and above is "Legendary Grandmaster." Only a handful of human competitors worldwide hold ratings above 3,000.

## AlphaCode (Version 1)

### When was AlphaCode released and published?

DeepMind published a blog post and preprint about AlphaCode on February 2, 2022.[4] The preprint appeared on arXiv in March 2022 (arXiv:2203.07814).[2] The peer-reviewed paper, titled "Competition-level code generation with AlphaCode," was published in *Science* on December 8, 2022 (Volume 378, Issue 6624, pages 1092-1097).[1] The lead authors were Yujia Li, David Choi, and Junyoung Chung, with over 40 co-authors including Oriol Vinyals, Nando de Freitas, and [Koray Kavukcuoglu](/wiki/koray_kavukcuoglu).[1]

### How is AlphaCode's architecture designed?

AlphaCode uses an [encoder-decoder](/wiki/sequence-to-sequence_task) [transformer](/wiki/transformer) architecture.[1] Unlike decoder-only models such as [GPT](/wiki/gpt-3), which process input and output in a single left-to-right stream, AlphaCode's encoder-decoder design allows the encoder to build a bidirectional representation of the problem description while the decoder generates code autoregressively, one token at a time.[1]

The architecture has several distinctive features:

- **Asymmetric token lengths:** The encoder processes up to 1,536 tokens of the problem description, while the decoder generates sequences of up to 768 tokens. This asymmetry reflects the fact that competitive programming problem descriptions tend to be roughly twice as long as their solutions.[1]
- **Shallow encoder, deep decoder:** The encoder uses fewer transformer layers than the decoder. This design significantly improved training efficiency without reducing the solve rate, because the decoder performs the more computationally demanding task of generating code token by token.[1]
- **Multi-query attention:** Rather than maintaining separate key and value heads for every attention head, AlphaCode shares key and value heads within each attention block while keeping a full set of query heads. This dramatically reduces memory usage and speeds up the sampling process, which is critical when generating millions of candidate solutions.[1]
- **SentencePiece tokenizer:** A vocabulary of 8,000 tokens was used to handle both natural language problem descriptions and source code in multiple programming languages.[1]

The team trained models at five different scales:

| Model Size | Parameters |
|---|---|
| Small | 300M |
| Medium | 1.1B |
| Large | 2.8B |
| XL | 8.7B |
| XXL | 41.1B |

The final evaluation used an ensemble that pooled samples from the 41.1B and 8.7B parameter models.[1]

### How was AlphaCode trained?

AlphaCode's training proceeds in two stages: pre-training on a large corpus of general code, followed by fine-tuning on competitive programming data.[1]

**[Pre-training](/wiki/pre-training):** The models were pre-trained on a 715.1 GB snapshot of publicly available code from GitHub, spanning 12 programming languages.[1] The training objective combined standard cross-entropy next-token prediction loss for the decoder with masked language modeling loss for the encoder. The initial learning rate was set to 10^-4 and decayed following a cosine schedule to 10^-5, with global gradient norms clipped to 1.0.[1]

**Fine-tuning:** After pre-training, the models were fine-tuned on the CodeContests dataset (described below).[1] The fine-tuning process incorporated three important techniques:

- **Tempering:** The output logits were divided by a temperature of T = 0.2 during training to sharpen the probability distribution, encouraging the model to commit more confidently to specific token predictions.[1]
- **GOLD (Generalized Offline Learning from Demonstrations):** An offline [reinforcement learning](/wiki/reinforcement_learning) algorithm that balances precision and recall. Because competitive programming problems can have many valid solutions, GOLD helps the model learn from the diversity of correct approaches rather than overfitting to a single solution style.[1]
- **Value conditioning:** Correctness labels were inserted into the training data so the model could learn to distinguish between correct and incorrect solution patterns. A value prediction auxiliary task further reinforced this signal.[1]

Ablation studies showed that each of these techniques contributed meaningfully to performance. Starting from a 1.1B base model achieving a 15.2% solve rate at 10 submissions from 100,000 samples, cumulative improvements brought the rate to 24.1%: masked language modeling (+1.8%), tempering (+1.7%), metadata tags and ratings (+0.6%), value conditioning (+0.9%), GOLD (+1.3%), and clustering (+2.6%).[1]

### What is the CodeContests dataset?

A key contribution of the AlphaCode project was the creation and public release of the CodeContests dataset, a benchmark for competitive programming code generation.[1][5] The dataset aggregates problems from multiple competitive programming platforms, including Codeforces, AtCoder, CodeChef, Aizu, and HackerEarth, along with data from the Description2Code and CodeNet collections.[1]

| Split | Number of Problems | Avg. Submissions per Problem |
|---|---|---|
| Training | 13,328 | 922.4 |
| Validation | 117 | - |
| Test | 165 | - |

The dataset uses a temporal split to prevent data leakage: all training problems were published before July 14, 2021; validation problems date from July 15 to September 20, 2021; and test problems come from September 21, 2021 onward.[1]

A critical feature of CodeContests is its extensive set of generated test cases. The authors found that existing competitive programming datasets suffered from a high false positive rate, where incorrect solutions could pass the provided tests by chance. By generating additional tests using a separate transformer model trained for this purpose, the false positive rate dropped from 62% to 4%.[1]

The CodeContests dataset is publicly available on GitHub (google-deepmind/code_contests) and [Hugging Face](/wiki/hugging_face).[5]

### How does the generate-and-test pipeline work?

AlphaCode's core innovation lies in its generate-and-test pipeline, which compensates for the imperfect accuracy of any single generated solution by producing a very large number of candidates and then selecting the most promising ones.[1]

**Step 1: Large-scale sampling.** For each problem, AlphaCode generates up to one million candidate solutions, split evenly between Python and C++. During sampling, the model is conditioned on randomized metadata (such as problem tags and difficulty ratings ranging from 800 to 3,500) to encourage diversity. The sampling temperature is set to 0.25 when using GOLD-trained models and 0.12 for tempering-only models. Standard sampling (rather than top-k or nucleus sampling) is used to maintain maximum diversity.[1]

**Step 2: Filtering.** Each generated program is executed against the example test cases provided in the problem statement. Programs that fail any example test are discarded. This step alone removes approximately 99% of all generated samples, typically leaving tens of thousands of candidates.[1]

**Step 3: Test input generation.** A separate transformer model, trained specifically for this purpose, generates additional synthetic test inputs for the problem. These inputs are designed to exercise edge cases and unusual scenarios that the example tests might not cover.[1]

**Step 4: [Clustering](/wiki/clustering).** The surviving candidate programs are executed on the synthetic test inputs. Programs that produce identical outputs for all synthetic inputs are grouped into the same cluster, under the assumption that programs behaving identically are likely implementing the same underlying algorithm. This behavioral clustering reduces tens of thousands of programs into a smaller number of distinct solution strategies.[1]

**Step 5: Selection.** One representative program is selected from each of the 10 largest clusters, yielding the final set of 10 submissions. The rationale is that the largest clusters represent the most common solution strategies, which are statistically more likely to be correct.[1]

### How well did AlphaCode perform on Codeforces?

AlphaCode was evaluated through simulated participation in 10 recent Codeforces contests, each with more than 5,000 human participants. The system was allowed 10 submissions per problem, matching the rules of Codeforces contests.[1]

| Metric | Result |
|---|---|
| Average ranking among participants | Top 54.3% |
| Estimated Codeforces Elo rating | ~1,238 |
| Solve rate on CodeContests test set (10 submissions from 1M samples) | 34.2% |
| Solve rate (10 submissions from 100K samples) | 29.6% |
| Solve rate (10 submissions from 10K samples) | 23.2% |

The estimated Elo rating of approximately 1,238 placed AlphaCode in the upper range of the "Pupil" category on Codeforces (ratings 1,200-1,399). While this is far from the level of top competitive programmers (whose ratings exceed 3,000), it represented the first time any AI system had reached a competitive level on this platform.[1]

The system showed varying performance across problem categories. It performed best on problems involving bitmasks (33.8% solve rate) and sorting (25.5%), while struggling more with dynamic programming problems (8.8%).[1]

### What does AlphaCode not do?

The AlphaCode team conducted analyses to verify that the system was not simply memorizing and regurgitating solutions from its training data. They found that AlphaCode's generated solutions shared substrings with training data at roughly the same rate as human-written solutions did, suggesting genuine generation rather than memorization. The proportion of dead code (unreachable or unused code) in AlphaCode's solutions was also comparable to that in human solutions.[1]

Sensitivity tests showed that AlphaCode was responsive to meaningful changes in problem descriptions (producing different solutions for modified problems) but robust to trivial variations in phrasing, indicating that it was processing the semantic content of the problem rather than matching surface patterns.[1]

The authors also noted an interesting finding about validation loss during fine-tuning. As fine-tuning progressed, validation loss actually increased, even as the solve rate improved. This means that validation loss is a poor proxy for competitive programming performance, and the team had to rely on solve rate evaluations rather than standard loss metrics to track progress during training.[1]

## What is AlphaCode 2 and how does it differ?

### Overview

AlphaCode 2 was announced on December 6, 2023, alongside the launch of Google's Gemini model family.[3] The AlphaCode 2 Technical Report was published by the AlphaCode Team at Google DeepMind.[3] The system represents a significant leap over the original AlphaCode, both in raw performance and in computational efficiency.[3]

Where AlphaCode 1 used a custom encoder-decoder transformer trained from scratch, AlphaCode 2 is built on top of [Gemini](/wiki/gemini) Pro, a large [multimodal](/wiki/multimodal_ai) language model developed by Google.[3] This foundation provides AlphaCode 2 with stronger reasoning capabilities and a richer understanding of both natural language and code.[3]

### Architecture and Training

AlphaCode 2 applies two consecutive rounds of [fine-tuning](/wiki/fine_tuning) to the Gemini Pro base model using the GOLD algorithm.[3] The first round fine-tunes on an updated version of the CodeContests dataset, which by this point contained approximately 15,000 problems and 30 million human code samples.[3] The second round further refines the model's ability to generate correct competitive programming solutions.[3]

The system's pipeline retains the same broad structure as AlphaCode 1 (generate, filter, cluster, select) but adds a learned scoring model:[3]

- **Policy models:** A family of fine-tuned Gemini Pro models generates code samples for each problem.[3]
- **Diverse sampling:** Up to one million code samples are generated per problem, with a randomized temperature parameter assigned to each sample to encourage diversity in solution strategies.[3]
- **Filtering:** Samples are executed against example test cases. Approximately 95% of samples are removed at this stage, leaving an average of roughly 50,000 candidates per problem.[3]
- **Clustering:** A separate model generates synthetic test inputs for each problem. Surviving samples are executed on these inputs, and programs producing identical output signatures are grouped into clusters.[3]
- **Scoring model:** A second Gemini Pro model, fine-tuned specifically for this task, assigns each candidate sample a predicted correctness score between 0 and 1. The highest-scoring sample from each of the 10 largest clusters is selected for submission.[3]

The addition of the learned scoring model is a notable improvement over AlphaCode 1, which simply picked an arbitrary representative from each cluster. By training a dedicated model to predict solution correctness, AlphaCode 2 can make more informed selections within each cluster.[3]

Google DeepMind noted that an even more powerful version could potentially be built using Gemini Ultra, a larger and more capable model than Gemini Pro. However, the AlphaCode 2 results were published using only the Gemini Pro foundation, leaving room for future improvements with stronger base models.[3]

### How well did AlphaCode 2 perform?

AlphaCode 2 was evaluated on 12 recent Codeforces contests, each with more than 8,000 participants, for a total of 77 problems.[3] Its performance represents a substantial improvement over the original system.

| Metric | AlphaCode 1 | AlphaCode 2 |
|---|---|---|
| Problems solved (relative) | Baseline | ~1.7x more |
| Problems solved within 10 attempts | 25% | 43% |
| Codeforces percentile | ~54th (top 54%) | ~85th (top 15%) |
| Estimated Codeforces rating | ~1,238 | ~1,650 |
| Codeforces rank category | Pupil | Expert to Candidate Master |

AlphaCode 2 solved 43% of competition problems within 10 attempts, compared to 25% for the original AlphaCode, and reached the 85th percentile on average, performing better than 85% of entrants.[3] The estimated Codeforces rating of approximately 1,650 places the system between the "Expert" (1,600-1,899) and "Candidate Master" (1,900-2,099) levels.[3]

### Sample Efficiency

One of AlphaCode 2's most striking improvements is its sample efficiency. The system is reported to be more than 10,000 times more sample-efficient than the original AlphaCode, meaning it can achieve comparable performance with far fewer generated candidates.[3] In practice, AlphaCode 2 can match AlphaCode 1's performance using only about 100 generated samples, compared to the one million samples that AlphaCode 1 required.[3] This improvement stems from the stronger base model (Gemini Pro), better fine-tuning, and the learned scoring mechanism.[3]

## How does AlphaCode compare with other code generation systems?

AlphaCode occupies a distinct position in the landscape of AI code generation. The following table compares AlphaCode with other notable systems.

| System | Developer | Year | Approach | Codeforces Elo (est.) | Codeforces Percentile |
|---|---|---|---|---|---|
| AlphaCode | [DeepMind](/wiki/deepmind) | 2022 | Encoder-decoder, generate 1M + filter/cluster | ~1,238 | ~54th |
| AlphaCode 2 | [Google DeepMind](/wiki/deepmind) | 2023 | Gemini Pro fine-tuned, generate + score/cluster | ~1,650 | ~85th |
| [Codex](/wiki/openai_codex) | [OpenAI](/wiki/openai) | 2021 | Decoder-only (GPT-3 variant), single-pass generation | Not competitive | Very low |
| [GPT-4](/wiki/gpt-4) | [OpenAI](/wiki/openai) | 2023 | Decoder-only, direct generation | ~392 | Below 5th |
| o1 | [OpenAI](/wiki/openai) | 2024 | Reasoning model with chain-of-thought | ~1,673 | ~89th |
| o3 | [OpenAI](/wiki/openai) | 2024 | Reasoning model, self-verification | ~2,727 | ~99.8th |

Several key differences stand out:

**AlphaCode vs. Codex and GPT-4:** Codex and [GPT-4](/wiki/gpt-4) are general-purpose code generation models optimized for broad programming assistance, including code completion, debugging, and explanation. They generate solutions in a single pass without the massive sampling and filtering pipeline that AlphaCode employs. On competitive programming tasks, GPT-4 scored an estimated Codeforces rating of only 392 (Newbie level, below the 5th percentile), while AlphaCode reached roughly 1,238.[7] The difference highlights that competitive programming requires a fundamentally different approach than general code assistance.

**AlphaCode vs. [OpenAI o1](/wiki/o1) and o3:** OpenAI's o-series reasoning models, released in 2024 and 2025, represent a different paradigm. Rather than generating millions of candidates and filtering externally, o1 and o3 use extended chain-of-thought reasoning to work through problems step by step. The o1 model achieved an estimated Codeforces rating of 1,673 (89th percentile), comparable to AlphaCode 2.[7] The o3 model pushed further, reaching approximately 2,727 (99.8th percentile), roughly equivalent to the 175th best human competitor globally.[7] Notably, o3 achieved this with far fewer samples (around 1,000 per problem) and without hand-crafted domain-specific strategies, relying instead on self-verification by writing and running brute-force solutions to cross-check its optimized implementations.[7]

**AlphaCodium:** Inspired by AlphaCode's generate-and-test philosophy, CodiumAI released AlphaCodium in January 2024, an open-source tool that uses a "flow engineering" approach to iterative code generation and testing.[8] AlphaCodium showed that the core insight behind AlphaCode (generating multiple candidates and testing them) could be applied more broadly, even without the massive computational resources that DeepMind's system requires.[8]

## What are the limitations and criticisms of AlphaCode?

AlphaCode's generate-and-test approach, while effective, has drawn several criticisms:

**Computational cost:** Generating up to one million candidate solutions per problem is extremely resource-intensive. The sampling, execution, and clustering pipeline requires substantial compute infrastructure, making it impractical for real-time or interactive use. AlphaCode 2 improved efficiency significantly, but the approach remains far more expensive per problem than single-pass generation.[3]

**Scaling concerns:** Computer scientist Ernest Davis observed that the system has "a substantial component of monkeys typing Hamlet." AlphaCode needed one million samples to achieve a 34% solve rate on problems with solutions averaging around 20 lines.[1] Critics have noted that the number of required samples could grow exponentially for longer and more complex programs, potentially limiting the approach's applicability beyond competitive programming.

**Lack of genuine understanding:** Unlike human programmers who use intuition, debugging, and iterative refinement to converge on a solution, AlphaCode relies on statistical coverage of the solution space. It does not debug its own code or reason about why a particular approach fails. The system's success comes from generating enough diverse candidates that at least one is likely correct, rather than from a deep understanding of the problem.

**Dataset dependency:** The filtering pipeline relies on having high-quality test cases to evaluate candidate solutions. In competitive programming, problems come with example tests and hidden test suites, making this approach natural. For general software engineering tasks, where requirements are often ambiguous and test coverage is incomplete, the generate-and-test strategy may be less effective.

**Limited problem complexity:** AlphaCode's performance dropped sharply on harder problem categories. Its 8.8% solve rate on dynamic programming problems (compared to 33.8% on bitmask problems) suggests that the system struggles with problems requiring multi-step reasoning and complex state management.[1]

## Impact and Legacy

AlphaCode's contributions to the field extend beyond its competitive programming results:

**Demonstrating the generate-and-test paradigm:** AlphaCode showed that for hard problems where single-pass generation fails, generating many candidates and filtering them based on execution behavior can be highly effective.[1] This insight has influenced subsequent work on test-time compute scaling and code verification.

**The CodeContests benchmark:** The publicly released CodeContests dataset has become a standard benchmark for evaluating code generation systems on competitive programming tasks.[5] Its rigorous test suite and temporal splits address many weaknesses of earlier benchmarks.

**Advancing code generation research:** By demonstrating that AI could reach human-competitive levels on a challenging and well-defined task, AlphaCode helped catalyze a wave of research and commercial interest in [AI-assisted programming](/wiki/ai_code_generation). Systems like [GitHub Copilot](/wiki/github_copilot), [Cursor](/wiki/cursor), and [Devin](/wiki/devin) owe part of their momentum to the broader attention that AlphaCode brought to the field.

**Bridging to reasoning models:** AlphaCode 2's integration with Gemini Pro foreshadowed the trend of building specialized capabilities on top of large foundation models.[3] The subsequent success of reasoning-oriented models (such as OpenAI's o-series) on competitive programming tasks suggests that the field is moving toward systems that combine strong base models with structured problem-solving strategies, a direction that AlphaCode helped pioneer.[7]

**Influencing test-time compute research:** AlphaCode demonstrated that allocating more computation at inference time (by generating and evaluating many candidates) could substitute for improvements to the base model itself.[1] This trade-off between training-time and test-time compute has become a central theme in AI research, with subsequent systems exploring various ways to spend additional computation during inference to improve output quality. The concept of scaling test-time compute, which gained significant traction in 2024 and 2025, traces part of its lineage to the large-scale sampling strategies that AlphaCode validated.

**Open-source contributions:** By releasing the CodeContests dataset and providing detailed descriptions of their methodology, DeepMind enabled the broader research community to build upon their work.[5] The dataset has been downloaded thousands of times from Hugging Face and GitHub, and it has been used as a benchmark in dozens of subsequent papers on code generation and program synthesis.

## Timeline

| Date | Event |
|---|---|
| February 2, 2022 | DeepMind announces AlphaCode via blog post and preprint |
| March 2022 | Preprint posted to arXiv (2203.07814) |
| December 8, 2022 | Peer-reviewed paper published in *Science* (Vol. 378, pp. 1092-1097) |
| December 6, 2023 | AlphaCode 2 Technical Report released alongside Gemini launch |

## See Also

- [Codex](/wiki/openai_codex)
- [GitHub Copilot](/wiki/github_copilot)
- [GPT-4](/wiki/gpt-4)
- [Gemini](/wiki/gemini)
- [DeepMind](/wiki/deepmind)
- [Transformer](/wiki/transformer)
- [Code Generation](/wiki/ai_code_generation)
- [Large Language Models](/wiki/large_language_model)

## References

1. Li, Y., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., ... & Vinyals, O. (2022). Competition-level code generation with AlphaCode. *Science*, 378(6624), 1092-1097. https://www.science.org/doi/10.1126/science.abq1158
2. Li, Y., et al. (2022). Competition-Level Code Generation with AlphaCode. *arXiv preprint arXiv:2203.07814*. https://arxiv.org/abs/2203.07814
3. AlphaCode Team, Google DeepMind. (2023). AlphaCode 2 Technical Report. https://storage.googleapis.com/deepmind-media/AlphaCode2/AlphaCode2_Tech_Report.pdf
4. Google DeepMind. (2022). Competitive programming with AlphaCode. Blog post. https://deepmind.google/blog/competitive-programming-with-alphacode/
5. Google DeepMind. (2023). Code Contests dataset. GitHub repository. https://github.com/google-deepmind/code_contests
6. Chen, M., et al. (2021). Evaluating Large Language Models Trained on Code. *arXiv preprint arXiv:2107.03374*.
7. OpenAI. (2025). Competitive Programming with Large [Reasoning](/wiki/reasoning) Models. *arXiv preprint arXiv:2502.06807*. https://arxiv.org/abs/2502.06807
8. Ridnik, T., et al. (2024). Code Generation with AlphaCodium: From [Prompt](/wiki/prompt) Engineering to Flow Engineering. CodiumAI. https://www.codium.ai/blog/alphacodium/