CodeContests

AI Benchmarks AI Code Generation Machine Learning

19 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

10 citations

Revision

v3 · 3,781 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

CodeContests is a competitive programming dataset created by Google DeepMind for training and evaluating machine learning models on algorithmic problem-solving tasks. Released alongside the AlphaCode system in 2022, the dataset aggregates problems from multiple online judge platforms, including Codeforces, AtCoder, CodeChef, Aizu, and HackerEarth.^[1] CodeContests played a central role in demonstrating that large language models can generate code at a level competitive with human programmers in programming contests.^[1]

The dataset was introduced in the paper "Competition-Level Code Generation with AlphaCode" by Yujia Li, David Choi, and colleagues at DeepMind, first published as a preprint on arXiv in February 2022^[2] and later in the journal Science in December 2022.^[1] CodeContests is publicly available on GitHub under the Apache 2.0 license (code) and Creative Commons Attribution 4.0 International license (data).^[7]

Background and Motivation

Before CodeContests, existing code generation benchmarks such as HumanEval and MBPP focused primarily on short, self-contained programming tasks: function-level completions, basic data manipulation, and introductory-level algorithmic challenges. While these benchmarks provided useful signals for measuring basic code synthesis ability, they did not capture the complexity of real-world algorithmic problem-solving that competitive programmers face.

Competitive programming problems require reading a natural language description that specifies input/output formats, constraints, and edge cases, and then writing a complete program that handles all possible inputs correctly within strict time and memory limits. These problems often demand knowledge of algorithms such as dynamic programming, graph traversal, greedy strategies, number theory, and combinatorics. The gap between simple function completion and full contest-level problem-solving motivated the creation of a more challenging evaluation framework.

A second motivation was test case quality. Existing benchmarks suffered from high false positive rates due to limited test coverage. HumanEval's handcrafted test cases resulted in approximately 30% false positives (where incorrect solutions passed all tests), and the APPS dataset showed false positive rates as high as 60%.^[1] The CodeContests authors aimed to address this by generating additional test cases through input mutation techniques and enforcing minimum coverage thresholds.^[1]

Dataset Composition

Sources

CodeContests draws problems from five competitive programming platforms, combining newly scraped data with two existing public datasets:

Platform	Integration Source	Role
Codeforces	Direct scraping + Description2Code	Primary source for training, validation, and test splits
AtCoder	Project CodeNet (Puri et al., 2021)^[6]	Training data
CodeChef	Description2Code (Caballero et al., 2016)^[5]	Training data
Aizu	Project CodeNet (Puri et al., 2021)^[6]	Training data
HackerEarth	Description2Code (Caballero et al., 2016)^[5]	Training data

The validation and test splits consist entirely of newly scraped Codeforces problems, while the training set combines Codeforces data with problems from Description2Code and CodeNet.^[1]

Dataset Splits

The dataset is divided into three partitions with a strict temporal split to prevent data leakage:

Split	Number of Problems	Temporal Cutoff
Training	13,328	Published on or before July 14, 2021
Validation	117	Published between July 15 and September 20, 2021
Test	165	Published after September 21, 2021

This temporal ordering ensures that no information from validation or test problems could have leaked into the training data. All pre-training and fine-tuning data appeared online before any validation problem, and all validation problems appeared before any test problem.^[1]

Problem Format

Each problem in CodeContests is stored as a protocol buffer (ContestProblem) in Riegeli format and contains the following fields:^[7]

Field	Type	Description
`name`	String	Problem name or identifier
`description`	String	Full natural language problem statement
`source`	Enum	Origin platform (Codeforces, CodeChef, HackerEarth, AtCoder, Aizu, CodeJam)
`difficulty`	Enum	Difficulty classification (Easy through Hardest, or contest letter A through V)
`public_tests`	Dict	Example input/output pairs visible before submission
`private_tests`	Dict	Hidden test cases used for judging
`generated_tests`	Dict	Automatically generated test cases via input mutation
`solutions`	Dict	Correct human solutions with language labels
`incorrect_solutions`	Dict	Incorrect human solutions with language labels
`time_limit`	Dict	Time constraint in seconds and nanoseconds
`memory_limit_bytes`	Integer	Memory constraint in bytes

For Codeforces problems specifically, additional metadata is available:

Field	Description
`cf_contest_id`	Codeforces contest identifier
`cf_index`	Problem letter within the contest (A, B, C, etc.)
`cf_points`	Point value for the problem
`cf_rating`	Difficulty rating (typically 800 to 3500)
`cf_tags`	Algorithm tags (e.g., "greedy", "dp", "graphs", "math")

Test Case Statistics

Each problem includes three categories of test cases, with the following average counts per problem:

Test Case Type	Training	Validation	Test
Example (public) tests	2.0	1.5	1.7
Hidden (private) tests	14.8	12.9	9.4
Generated tests	79.1	190.0	192.7

The generated test cases significantly outnumber the original hidden tests, particularly in the validation and test splits where rigorous evaluation is most important.

Human Solutions

The dataset includes both correct and incorrect human submissions across multiple programming languages:

Language	Avg. Solutions (Train)	% Correct (Train)	Avg. Solutions (Valid)	% Correct (Valid)	Avg. Solutions (Test)	% Correct (Test)
C++	493.4	27%	231.6	47%	196.0	45%
Python	281.1	47%	137.2	55%	97.3	54%
Java	147.9	46%	131.1	54%	105.2	51%

The inclusion of incorrect solutions is a deliberate design choice. During fine-tuning, models can be conditioned on correctness labels (marking solutions as CORRECT or INCORRECT), which helps the model learn to distinguish between working and broken code.^[2]

Test Case Generation

One of CodeContests' key contributions is its approach to generating additional test cases to reduce false positive rates. The original hidden test cases from competitive programming platforms, while curated by problem setters, often provide insufficient coverage for evaluating machine-generated code.

Mutation Techniques

The test generation process works by mutating existing test inputs using several strategies:

Bit flips on binary representations of integer inputs
Random increments and decrements applied to integer values
Character swaps and substitutions in string inputs
Boundary value adjustments to probe edge cases

For each mutated input, the system runs 30 known-correct human solutions and checks whether all produce identical output. If all correct solutions agree on the output for a given mutated input, that input-output pair is accepted as a valid generated test case. The process allocates up to 10 CPU hours per problem, with a target of 200 generated tests per problem. This procedure succeeded in generating a full test suite for 93.7% of problems.^[1]

Coverage Requirements

For validation and test set problems, the dataset enforces a minimum quality threshold: each problem must have at least 5 hidden or generated test cases that produce at least 2 different outputs across the test suite. This requirement prevents trivially solvable problems where a model could pass all tests by always outputting a constant value.^[1]

Impact on False Positive Rates

The combination of generated tests and filtering dramatically improved evaluation reliability:^[1]

Dataset	False Positive Rate
APPS	~60%
CodeContests (original hidden tests only)	~62%
HumanEval	~30%
CodeContests (with generated tests + filtering)	~4%

Reducing the false positive rate from 62% to 4% was a significant achievement, as it meant that solutions passing all test cases were far more likely to be genuinely correct rather than coincidentally producing the right output for an insufficient test suite.

Evaluation Metric: n@k

CodeContests uses a metric called n@k (also written as pass@k in related literature) to evaluate model performance. This metric measures the percentage of problems solved when a model is allowed to generate k candidate solutions and submit the n most promising ones.

The metric is formally defined as the fraction of problems where at least one of the n submitted solutions passes all private test cases, given that the n submissions are selected from k total generated samples. For example, 10@1000 means the model generates 1,000 candidate solutions per problem, selects the 10 best through filtering and clustering, and reports the fraction of problems where at least one of those 10 passes all tests.^[2]

This evaluation protocol reflects the real-world competitive programming setting on Codeforces, where contestants can make multiple submissions for each problem. The selection of n candidates from k samples involves two stages:

Filtering: Candidates are first filtered by running them against the public (example) test cases. This step eliminates over 99% of generated samples.
Clustering: The remaining candidates are clustered based on their output behavior on the generated test cases. Solutions that produce identical outputs across all test inputs are grouped together. The system then selects one representative from each of the n largest clusters for submission, based on the intuition that the most common solution behavior is most likely to be correct.

AlphaCode: The Primary System Evaluated on CodeContests

AlphaCode was the first AI system evaluated on the CodeContests benchmark, and the dataset was specifically designed as part of the AlphaCode project.^[1]

Architecture

AlphaCode uses an encoder-decoder transformer architecture with an asymmetric design. The encoder processes up to 1,536 tokens (the problem description), while the decoder generates up to 768 tokens (the solution code). The model uses multi-query attention, where key-value heads are shared across attention heads within each block, and a SentencePiece tokenizer with a vocabulary of 8,000 tokens.^[2]

DeepMind trained five model variants of increasing scale:

Model Size	Parameters	Hidden Dim	Encoder Blocks	Decoder Blocks	Training Tokens
300M	284M	768	4	24	354B
1B	1.1B	1,408	5	30	590B
3B	2.8B	2,048	6	36	826B
9B	8.7B	3,072	8	48	1,250B
41B	41.1B	6,144	8	56	967B

Pre-training

The models were pre-trained on 715.1 GB of code from a GitHub snapshot taken on July 14, 2021. The pre-training corpus covered 12 programming languages: C++, C#, Go, Java, JavaScript, Lua, PHP, Python, Ruby, Rust, Scala, and TypeScript. Files larger than 1 MB and lines longer than 1,000 characters were excluded.^[2]

Pre-training used standard cross-entropy next-token prediction for the decoder and masked language modeling for the encoder, with the AdamW optimizer and a learning rate schedule that warmed up from 10^-4 and decayed to 10^-5.^[2]

Fine-tuning on CodeContests

Several specialized techniques were applied during fine-tuning on the CodeContests training set:

Tempering: A training temperature of T=0.2 was used to sharpen the output distribution
Value conditioning: Solutions were tagged as CORRECT or INCORRECT, allowing the model to condition generation on the desired outcome
Value prediction: An auxiliary classification task where the model predicts whether a given solution is correct
GOLD (Generalized pOLicy improvement for language Decisions): An offline reinforcement learning technique adapted for the one-of-many nature of competitive programming, where multiple valid solutions exist for each problem
Metadata conditioning: Problems were annotated with Codeforces difficulty ratings (800-3500) and the 50 most popular algorithm tags

Results on CodeContests

AlphaCode's performance on the CodeContests benchmark, using the 41B parameter model:

Metric	Validation Set	Test Set
10@1k	21.0%	16.4%
10@10k	26.2%	25.4%
10@100k	31.8%	29.6%
10@1M	34.2%	Not reported

Performance varied by problem type (10@10k with the 41B model on the test set):

Problem Tag	Solve Rate
Bitmasks	33.8%
Math	28.2%
Sortings	25.5%
Greedy	25.0%
Constructive	14.9%
Graphs	13.6%
Dynamic Programming	8.8%

Codeforces Competition Results

AlphaCode was evaluated on 10 recent Codeforces contests held in December 2021, each with over 5,000 participants. Key results:

Average ranking: Top 54.3% of participants
Estimated Codeforces rating: 1,238 (approximately "Specialist" level)
Among recently active users: Top 28%
Range across contests: Rankings varied from 20.6% to 73.9%
Average submissions per solved problem: 2.4 (within the 10-submission limit)

This was the first time an AI system achieved competitive performance in real programming contests against human participants.^[1]

Contribution of Each Technique

Ablation studies on the validation set (10@100k) showed the incremental contribution of each training and inference technique:^[2]

Configuration	Solve Rate
Baseline (standard fine-tuning)	15.2%
+ Masked language modeling	17.0%
+ Tempering	18.7%
+ Tags and ratings conditioning	19.3%
+ Value conditioning	20.2%
+ GOLD training	21.5%
+ Clustering-based selection	24.1%

AlphaCode 2

In December 2023, Google DeepMind released a technical report on AlphaCode 2, a successor system built on top of the Gemini Pro model. AlphaCode 2 used an updated version of the CodeContests dataset containing approximately 15,000 problems and 30 million human code samples with higher-quality manually curated tests on the validation set.^[3]

AlphaCode 2 was tested on 77 problems across 12 Codeforces contests with more than 8,000 total participants. Key results:

Solved 43% of competition problems (compared to AlphaCode's 25%)
Achieved an estimated 85th percentile ranking, placing between the "Expert" and "Candidate Master" categories on Codeforces
In two of the twelve contests, AlphaCode 2 outperformed 99.5% of participants
Solved 1.7 times more problems than the original AlphaCode when evaluated on the same platform

Like the original AlphaCode, the system was allowed up to 10 submissions per problem, while human contestants typically get one attempt per submission.^[3]

Comparison with Other Code Generation Benchmarks

CodeContests occupies a distinct position among code generation benchmarks, targeting a much higher difficulty level than most alternatives.

Key Differences

Feature	HumanEval	MBPP	APPS	CodeContests
Number of problems	164	974	10,000	13,610
Problem source	Hand-written	Crowd-sourced	Coding platforms	Competitive programming platforms
Difficulty level	Basic function completion	Introductory programming	Intro to competition	Competition-level
Test case origin	Handcrafted	Handcrafted	Platform tests	Platform + generated
False positive rate	~30%	High (limited tests)	~60%	~4% (with generated tests)
Includes incorrect solutions	No	No	No	Yes
Multiple languages	Python only	Python only	Python only	C++, Python, Java
Metadata (ratings, tags)	No	No	Limited	Yes (Codeforces)

Performance Comparison (APPS Benchmark)

AlphaCode was also evaluated on the APPS benchmark, providing a cross-benchmark comparison with other models:^[2]

Model	Parameters	Samples	Introductory	Interview	Competition
GPT-Neo 2.7B	2.7B	5 attempts	5.50%	0.80%	0.00%
Codex 12B^[10]	12B	1,000 samples, 5 attempts	24.52%	3.23%	3.08%
AlphaCode 1B	1.1B	10k samples, 5 attempts	18.18%	8.21%	6.65%

AlphaCode outperformed Codex on Interview and Competition-level problems despite using a smaller model, highlighting the effectiveness of the CodeContests training approach and the filtering/clustering inference strategy.^[2]

Later Work and Extensions

AlphaCodium (2024)

In January 2024, researchers from CodiumAI (now Qodo) introduced AlphaCodium, a test-based iterative flow engineering approach evaluated on CodeContests.^[4] Rather than generating massive numbers of samples, AlphaCodium uses a multi-stage pipeline:

Problem analysis and self-reflection
Public test reasoning
AI-generated additional test cases
Iterative code generation and testing

Key results on CodeContests:

Model	Method	Validation (pass@5)	Test (pass@5)
GPT-3.5	CodeChain	17%	Not reported
GPT-3.5	AlphaCodium	25%	17%
GPT-4	Direct prompt	19%	Not reported
GPT-4	AlphaCodium	44%	29%

AlphaCodium achieved comparable accuracy to AlphaCode while using four orders of magnitude fewer LLM calls, demonstrating that structured inference flows can compensate for brute-force sample generation.^[4]

CodeContests+ (2025)

Published in the EMNLP 2025 Findings, CodeContests+ introduced an LLM-based agent system for generating higher-quality test cases for the CodeContests dataset.^[9] The improved version includes:

11,690 competitive programming problems
Corresponding test case generators, validators, and output checkers
Over 13 million correct and incorrect solutions
Evaluation using 1.72 million submissions with pass/fail labels

CodeContests+ roughly doubled the number of "effective problems" (those with both True Positive Rate and True Negative Rate at or above 0.9) compared to the original CodeContests dataset. Even with only one-quarter of the test cases, CodeContests+ yielded over 80% more qualified problems than the original.^[9]

CodeContests-O (2026)

A further extension, CodeContests-O, introduced a feedback-driven iterative framework for test case generation. CodeContests-O test suites achieved a True Positive Rate of 89.37% and a True Negative Rate of 90.89%, outperforming both CodeContests and CodeContests+ by 4 to 9 percentage points in discriminating correct from incorrect solutions.

Technical Details

Data Format and Access

The raw dataset is distributed in Riegeli format (a Google compression format) with protocol buffer schemas. The data is sharded across 128 files for the training split. The full dataset is approximately 3 GB in size.^[7]

For easier access, the dataset is also available on Hugging Face in Parquet format (approximately 2.22 GB), where it can be loaded directly using the Hugging Face Datasets library:^[8]

from datasets import load_dataset
dataset = load_dataset("deepmind/code_contests")

Build and Execution Environment

The official codebase requires Bazel for building and supports Linux with clang. Code execution is sandboxed, with support for:^[7]

C++ compilation and execution
Python 2.7 and Python 3.9+
Java compilation and execution

Solutions are not guaranteed to compile and execute identically to the original contest environments due to differences in compiler versions and available libraries. Some submissions may fail compilation or trigger sandbox violations.

Licensing

Code: Apache 2.0 License^[7]
Data: Creative Commons Attribution 4.0 International (CC BY 4.0)^[7]

Significance and Impact

CodeContests has had a meaningful impact on the field of AI code generation in several ways.

First, it established competitive programming as a rigorous evaluation paradigm for code-generating AI systems. Before CodeContests, most evaluations focused on isolated function synthesis. The dataset demonstrated that evaluating models on full contest problems, with strict correctness requirements and adversarial test cases, provides a much more reliable signal of genuine problem-solving ability.

Second, the dataset's approach to test case generation and false positive reduction set a new standard for code evaluation rigor. The mutation-based test generation pipeline and the reduction of false positive rates from 62% to 4% showed that evaluation reliability is just as important as problem difficulty.^[1]

Third, the dataset enabled direct comparison between AI systems and human competitive programmers using the Codeforces rating system. This provided an intuitive, widely understood benchmark for AI code generation capability that goes beyond abstract accuracy percentages.

Fourth, CodeContests has been widely adopted as a standard benchmark in the code generation research community. Systems from multiple research groups, including AlphaCodium, CodeChain, and various large language models, have reported results on the dataset, making it a common reference point for measuring progress.

Limitations

Despite its contributions, CodeContests has several known limitations:

Limited problem count for evaluation: The validation and test sets contain only 117 and 165 problems respectively, which limits statistical confidence in reported results. Small differences in solve rates may not be meaningful.
Test case quality: Research in 2025 found that approximately one-third of the test cases in the original CodeContests dataset were incorrect, motivating the development of CodeContests+ and CodeContests-O.^[9] Incorrect test cases can lead to both false positives and false negatives.
Narrow problem domain: The dataset focuses exclusively on competitive programming, which represents a specific style of algorithmic problem-solving. It does not cover software engineering tasks like debugging, code review, refactoring, or building multi-file applications.
Platform bias: The validation and test sets draw exclusively from Codeforces, which may introduce stylistic biases specific to that platform's problem-setting conventions.
Language coverage: While the dataset includes solutions in C++, Python, and Java, competitive programming is heavily biased toward C++ due to its performance characteristics. The distribution of solutions does not reflect general software development language usage.
Evaluation conditions: The n@k evaluation metric, which allows models to generate thousands or millions of candidate solutions and submit only the best ones, does not directly map to how human programmers work. The computational cost of generating millions of samples is substantial.

References

Li, Y., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., Eccles, T., Keeling, J., Gimeno, F., Dal Lago, A., Hubert, T., Choy, P., de Masson d'Autume, C., Babuschkin, I., Chen, X., Huang, P.-S., Welbl, J., Gowal, S., Cherepanov, A., Molloy, J., Mankowitz, D.J., Sutherland Robson, E., Kohli, P., de Freitas, N., Kavukcuoglu, K., & Vinyals, O. (2022). Competition-Level Code Generation with AlphaCode. *Science*, 378(6624), 1092-1097. https://doi.org/10.1126/science.abq1158 ↩
Li, Y., Choi, D., Chung, J., et al. (2022). Competition-Level Code Generation with AlphaCode. *arXiv preprint arXiv:2203.07814*. https://arxiv.org/abs/2203.07814 ↩
AlphaCode Team, Google DeepMind. (2023). AlphaCode 2 Technical Report. https://storage.googleapis.com/deepmind-media/AlphaCode2/AlphaCode2_Tech_Report.pdf ↩
Ridnik, T., Kredo, D., & Friedman, I. (2024). Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering. *arXiv preprint arXiv:2401.08500*. https://arxiv.org/abs/2401.08500 ↩
Caballero, E., et al. (2016). Description2Code Dataset. GitHub. ↩
Puri, R., Kung, D.S., Janssen, G., Zhang, W., Domeniconi, G., Zolber, V., Muber, J., Watt, A., Svyatkovskiy, A., Varma, N., Besber, E., Doran, M., Telle, M., & Sifa, R. (2021). Project CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks. *arXiv preprint arXiv:2105.12655*. ↩
Google DeepMind. CodeContests GitHub Repository. https://github.com/google-deepmind/code_contests ↩
Google DeepMind. CodeContests on Hugging Face. https://huggingface.co/datasets/deepmind/code_contests ↩
Huang, Q., et al. (2025). CodeContests+: High-Quality Test Case Generation for Competitive Programming. *Findings of EMNLP 2025*. https://arxiv.org/abs/2506.05817 ↩
Chen, M., Tworek, J., Jun, H., et al. (2021). Evaluating Large Language Models Trained on Code. *arXiv preprint arXiv:2107.03374*. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

HumanEval The Stack (BigCode dataset)

Background and Motivation

Dataset Composition

Sources

Dataset Splits

Problem Format

Test Case Statistics

Human Solutions

Test Case Generation

Mutation Techniques

Coverage Requirements

Impact on False Positive Rates

Evaluation Metric: n@k

AlphaCode: The Primary System Evaluated on CodeContests

Architecture

Pre-training

Fine-tuning on CodeContests

Results on CodeContests

Codeforces Competition Results

Contribution of Each Technique

AlphaCode 2

Comparison with Other Code Generation Benchmarks

Key Differences

Performance Comparison (APPS Benchmark)

Later Work and Extensions

AlphaCodium (2024)

CodeContests+ (2025)

CodeContests-O (2026)

Technical Details

Data Format and Access

Build and Execution Environment

Licensing

Significance and Impact

Limitations

See Also

References

Improve this article

Related Articles

LiveCodeBench

MBPP

CRUXEval

Pass@k

HumanEval

SWE-bench Verified

What links here

Related Articles

LiveCodeBench

MBPP

CRUXEval

Pass@k

HumanEval

SWE-bench Verified

What links here