CodeContests is a competitive programming dataset created by Google DeepMind for training and evaluating machine learning models on algorithmic problem-solving tasks. Released alongside the AlphaCode system in 2022, the dataset aggregates problems from multiple online judge platforms, including Codeforces, AtCoder, CodeChef, Aizu, and HackerEarth. CodeContests played a central role in demonstrating that large language models can generate code at a level competitive with human programmers in programming contests.
The dataset was introduced in the paper "Competition-Level Code Generation with AlphaCode" by Yujia Li, David Choi, and colleagues at DeepMind, first published as a preprint on arXiv in February 2022 and later in the journal Science in December 2022. CodeContests is publicly available on GitHub under the Apache 2.0 license (code) and Creative Commons Attribution 4.0 International license (data).
Before CodeContests, existing code generation benchmarks such as HumanEval and MBPP focused primarily on short, self-contained programming tasks: function-level completions, basic data manipulation, and introductory-level algorithmic challenges. While these benchmarks provided useful signals for measuring basic code synthesis ability, they did not capture the complexity of real-world algorithmic problem-solving that competitive programmers face.
Competitive programming problems require reading a natural language description that specifies input/output formats, constraints, and edge cases, and then writing a complete program that handles all possible inputs correctly within strict time and memory limits. These problems often demand knowledge of algorithms such as dynamic programming, graph traversal, greedy strategies, number theory, and combinatorics. The gap between simple function completion and full contest-level problem-solving motivated the creation of a more challenging evaluation framework.
A second motivation was test case quality. Existing benchmarks suffered from high false positive rates due to limited test coverage. HumanEval's handcrafted test cases resulted in approximately 30% false positives (where incorrect solutions passed all tests), and the APPS dataset showed false positive rates as high as 60%. The CodeContests authors aimed to address this by generating additional test cases through input mutation techniques and enforcing minimum coverage thresholds.
CodeContests draws problems from five competitive programming platforms, combining newly scraped data with two existing public datasets:
| Platform | Integration Source | Role |
|---|---|---|
| Codeforces | Direct scraping + Description2Code | Primary source for training, validation, and test splits |
| AtCoder | Project CodeNet (Puri et al., 2021) | Training data |
| CodeChef | Description2Code (Caballero et al., 2016) | Training data |
| Aizu | Project CodeNet (Puri et al., 2021) | Training data |
| HackerEarth | Description2Code (Caballero et al., 2016) | Training data |
The validation and test splits consist entirely of newly scraped Codeforces problems, while the training set combines Codeforces data with problems from Description2Code and CodeNet.
The dataset is divided into three partitions with a strict temporal split to prevent data leakage:
| Split | Number of Problems | Temporal Cutoff |
|---|---|---|
| Training | 13,328 | Published on or before July 14, 2021 |
| Validation | 117 | Published between July 15 and September 20, 2021 |
| Test | 165 | Published after September 21, 2021 |
This temporal ordering ensures that no information from validation or test problems could have leaked into the training data. All pre-training and fine-tuning data appeared online before any validation problem, and all validation problems appeared before any test problem.
Each problem in CodeContests is stored as a protocol buffer (ContestProblem) in Riegeli format and contains the following fields:
| Field | Type | Description |
|---|---|---|
name | String | Problem name or identifier |
description | String | Full natural language problem statement |
source | Enum | Origin platform (Codeforces, CodeChef, HackerEarth, AtCoder, Aizu, CodeJam) |
difficulty | Enum | Difficulty classification (Easy through Hardest, or contest letter A through V) |
public_tests | Dict | Example input/output pairs visible before submission |
private_tests | Dict | Hidden test cases used for judging |
generated_tests | Dict | Automatically generated test cases via input mutation |
solutions | Dict | Correct human solutions with language labels |
incorrect_solutions | Dict | Incorrect human solutions with language labels |
time_limit | Dict | Time constraint in seconds and nanoseconds |
memory_limit_bytes | Integer | Memory constraint in bytes |
For Codeforces problems specifically, additional metadata is available:
| Field | Description |
|---|---|
cf_contest_id | Codeforces contest identifier |
cf_index | Problem letter within the contest (A, B, C, etc.) |
cf_points | Point value for the problem |
cf_rating | Difficulty rating (typically 800 to 3500) |
cf_tags | Algorithm tags (e.g., "greedy", "dp", "graphs", "math") |
Each problem includes three categories of test cases, with the following average counts per problem:
| Test Case Type | Training | Validation | Test |
|---|---|---|---|
| Example (public) tests | 2.0 | 1.5 | 1.7 |
| Hidden (private) tests | 14.8 | 12.9 | 9.4 |
| Generated tests | 79.1 | 190.0 | 192.7 |
The generated test cases significantly outnumber the original hidden tests, particularly in the validation and test splits where rigorous evaluation is most important.
The dataset includes both correct and incorrect human submissions across multiple programming languages:
| Language | Avg. Solutions (Train) | % Correct (Train) | Avg. Solutions (Valid) | % Correct (Valid) | Avg. Solutions (Test) | % Correct (Test) |
|---|---|---|---|---|---|---|
| C++ | 493.4 | 27% | 231.6 | 47% | 196.0 | 45% |
| Python | 281.1 | 47% | 137.2 | 55% | 97.3 | 54% |
| Java | 147.9 | 46% | 131.1 | 54% | 105.2 | 51% |
The inclusion of incorrect solutions is a deliberate design choice. During fine-tuning, models can be conditioned on correctness labels (marking solutions as CORRECT or INCORRECT), which helps the model learn to distinguish between working and broken code.
One of CodeContests' key contributions is its approach to generating additional test cases to reduce false positive rates. The original hidden test cases from competitive programming platforms, while curated by problem setters, often provide insufficient coverage for evaluating machine-generated code.
The test generation process works by mutating existing test inputs using several strategies:
For each mutated input, the system runs 30 known-correct human solutions and checks whether all produce identical output. If all correct solutions agree on the output for a given mutated input, that input-output pair is accepted as a valid generated test case. The process allocates up to 10 CPU hours per problem, with a target of 200 generated tests per problem. This procedure succeeded in generating a full test suite for 93.7% of problems.
For validation and test set problems, the dataset enforces a minimum quality threshold: each problem must have at least 5 hidden or generated test cases that produce at least 2 different outputs across the test suite. This requirement prevents trivially solvable problems where a model could pass all tests by always outputting a constant value.
The combination of generated tests and filtering dramatically improved evaluation reliability:
| Dataset | False Positive Rate |
|---|---|
| APPS | ~60% |
| CodeContests (original hidden tests only) | ~62% |
| HumanEval | ~30% |
| CodeContests (with generated tests + filtering) | ~4% |
Reducing the false positive rate from 62% to 4% was a significant achievement, as it meant that solutions passing all test cases were far more likely to be genuinely correct rather than coincidentally producing the right output for an insufficient test suite.
CodeContests uses a metric called n@k (also written as pass@k in related literature) to evaluate model performance. This metric measures the percentage of problems solved when a model is allowed to generate k candidate solutions and submit the n most promising ones.
The metric is formally defined as the fraction of problems where at least one of the n submitted solutions passes all private test cases, given that the n submissions are selected from k total generated samples. For example, 10@1000 means the model generates 1,000 candidate solutions per problem, selects the 10 best through filtering and clustering, and reports the fraction of problems where at least one of those 10 passes all tests.
This evaluation protocol reflects the real-world competitive programming setting on Codeforces, where contestants can make multiple submissions for each problem. The selection of n candidates from k samples involves two stages:
AlphaCode was the first AI system evaluated on the CodeContests benchmark, and the dataset was specifically designed as part of the AlphaCode project.
AlphaCode uses an encoder-decoder transformer architecture with an asymmetric design. The encoder processes up to 1,536 tokens (the problem description), while the decoder generates up to 768 tokens (the solution code). The model uses multi-query attention, where key-value heads are shared across attention heads within each block, and a SentencePiece tokenizer with a vocabulary of 8,000 tokens.
DeepMind trained five model variants of increasing scale:
| Model Size | Parameters | Hidden Dim | Encoder Blocks | Decoder Blocks | Training Tokens |
|---|---|---|---|---|---|
| 300M | 284M | 768 | 4 | 24 | 354B |
| 1B | 1.1B | 1,408 | 5 | 30 | 590B |
| 3B | 2.8B | 2,048 | 6 | 36 | 826B |
| 9B | 8.7B | 3,072 | 8 | 48 | 1,250B |
| 41B | 41.1B | 6,144 | 8 | 56 | 967B |
The models were pre-trained on 715.1 GB of code from a GitHub snapshot taken on July 14, 2021. The pre-training corpus covered 12 programming languages: C++, C#, Go, Java, JavaScript, Lua, PHP, Python, Ruby, Rust, Scala, and TypeScript. Files larger than 1 MB and lines longer than 1,000 characters were excluded.
Pre-training used standard cross-entropy next-token prediction for the decoder and masked language modeling for the encoder, with the AdamW optimizer and a learning rate schedule that warmed up from 10^-4 and decayed to 10^-5.
Several specialized techniques were applied during fine-tuning on the CodeContests training set:
AlphaCode's performance on the CodeContests benchmark, using the 41B parameter model:
| Metric | Validation Set | Test Set |
|---|---|---|
| 10@1k | 21.0% | 16.4% |
| 10@10k | 26.2% | 25.4% |
| 10@100k | 31.8% | 29.6% |
| 10@1M | 34.2% | Not reported |
Performance varied by problem type (10@10k with the 41B model on the test set):
| Problem Tag | Solve Rate |
|---|---|
| Bitmasks | 33.8% |
| Math | 28.2% |
| Sortings | 25.5% |
| Greedy | 25.0% |
| Constructive | 14.9% |
| Graphs | 13.6% |
| Dynamic Programming | 8.8% |
AlphaCode was evaluated on 10 recent Codeforces contests held in December 2021, each with over 5,000 participants. Key results:
This was the first time an AI system achieved competitive performance in real programming contests against human participants.
Ablation studies on the validation set (10@100k) showed the incremental contribution of each training and inference technique:
| Configuration | Solve Rate |
|---|---|
| Baseline (standard fine-tuning) | 15.2% |
| + Masked language modeling | 17.0% |
| + Tempering | 18.7% |
| + Tags and ratings conditioning | 19.3% |
| + Value conditioning | 20.2% |
| + GOLD training | 21.5% |
| + Clustering-based selection | 24.1% |
In December 2023, Google DeepMind released a technical report on AlphaCode 2, a successor system built on top of the Gemini Pro model. AlphaCode 2 used an updated version of the CodeContests dataset containing approximately 15,000 problems and 30 million human code samples with higher-quality manually curated tests on the validation set.
AlphaCode 2 was tested on 77 problems across 12 Codeforces contests with more than 8,000 total participants. Key results:
Like the original AlphaCode, the system was allowed up to 10 submissions per problem, while human contestants typically get one attempt per submission.
CodeContests occupies a distinct position among code generation benchmarks, targeting a much higher difficulty level than most alternatives.
| Feature | HumanEval | MBPP | APPS | CodeContests |
|---|---|---|---|---|
| Number of problems | 164 | 974 | 10,000 | 13,610 |
| Problem source | Hand-written | Crowd-sourced | Coding platforms | Competitive programming platforms |
| Difficulty level | Basic function completion | Introductory programming | Intro to competition | Competition-level |
| Test case origin | Handcrafted | Handcrafted | Platform tests | Platform + generated |
| False positive rate | ~30% | High (limited tests) | ~60% | ~4% (with generated tests) |
| Includes incorrect solutions | No | No | No | Yes |
| Multiple languages | Python only | Python only | Python only | C++, Python, Java |
| Metadata (ratings, tags) | No | No | Limited | Yes (Codeforces) |
AlphaCode was also evaluated on the APPS benchmark, providing a cross-benchmark comparison with other models:
| Model | Parameters | Samples | Introductory | Interview | Competition |
|---|---|---|---|---|---|
| GPT-Neo 2.7B | 2.7B | 5 attempts | 5.50% | 0.80% | 0.00% |
| Codex 12B | 12B | 1,000 samples, 5 attempts | 24.52% | 3.23% | 3.08% |
| AlphaCode 1B | 1.1B | 10k samples, 5 attempts | 18.18% | 8.21% | 6.65% |
AlphaCode outperformed Codex on Interview and Competition-level problems despite using a smaller model, highlighting the effectiveness of the CodeContests training approach and the filtering/clustering inference strategy.
In January 2024, researchers from CodiumAI (now Qodo) introduced AlphaCodium, a test-based iterative flow engineering approach evaluated on CodeContests. Rather than generating massive numbers of samples, AlphaCodium uses a multi-stage pipeline:
Key results on CodeContests:
| Model | Method | Validation (pass@5) | Test (pass@5) |
|---|---|---|---|
| GPT-3.5 | CodeChain | 17% | Not reported |
| GPT-3.5 | AlphaCodium | 25% | 17% |
| GPT-4 | Direct prompt | 19% | Not reported |
| GPT-4 | AlphaCodium | 44% | 29% |
AlphaCodium achieved comparable accuracy to AlphaCode while using four orders of magnitude fewer LLM calls, demonstrating that structured inference flows can compensate for brute-force sample generation.
Published in the EMNLP 2025 Findings, CodeContests+ introduced an LLM-based agent system for generating higher-quality test cases for the CodeContests dataset. The improved version includes:
CodeContests+ roughly doubled the number of "effective problems" (those with both True Positive Rate and True Negative Rate at or above 0.9) compared to the original CodeContests dataset. Even with only one-quarter of the test cases, CodeContests+ yielded over 80% more qualified problems than the original.
A further extension, CodeContests-O, introduced a feedback-driven iterative framework for test case generation. CodeContests-O test suites achieved a True Positive Rate of 89.37% and a True Negative Rate of 90.89%, outperforming both CodeContests and CodeContests+ by 4 to 9 percentage points in discriminating correct from incorrect solutions.
The raw dataset is distributed in Riegeli format (a Google compression format) with protocol buffer schemas. The data is sharded across 128 files for the training split. The full dataset is approximately 3 GB in size.
For easier access, the dataset is also available on Hugging Face in Parquet format (approximately 2.22 GB), where it can be loaded directly using the Hugging Face Datasets library:
from datasets import load_dataset
dataset = load_dataset("deepmind/code_contests")
The official codebase requires Bazel for building and supports Linux with clang. Code execution is sandboxed, with support for:
Solutions are not guaranteed to compile and execute identically to the original contest environments due to differences in compiler versions and available libraries. Some submissions may fail compilation or trigger sandbox violations.
CodeContests has had a meaningful impact on the field of AI code generation in several ways.
First, it established competitive programming as a rigorous evaluation paradigm for code-generating AI systems. Before CodeContests, most evaluations focused on isolated function synthesis. The dataset demonstrated that evaluating models on full contest problems, with strict correctness requirements and adversarial test cases, provides a much more reliable signal of genuine problem-solving ability.
Second, the dataset's approach to test case generation and false positive reduction set a new standard for code evaluation rigor. The mutation-based test generation pipeline and the reduction of false positive rates from 62% to 4% showed that evaluation reliability is just as important as problem difficulty.
Third, the dataset enabled direct comparison between AI systems and human competitive programmers using the Codeforces rating system. This provided an intuitive, widely understood benchmark for AI code generation capability that goes beyond abstract accuracy percentages.
Fourth, CodeContests has been widely adopted as a standard benchmark in the code generation research community. Systems from multiple research groups, including AlphaCodium, CodeChain, and various large language models, have reported results on the dataset, making it a common reference point for measuring progress.
Despite its contributions, CodeContests has several known limitations:
Limited problem count for evaluation: The validation and test sets contain only 117 and 165 problems respectively, which limits statistical confidence in reported results. Small differences in solve rates may not be meaningful.
Test case quality: Research in 2025 found that approximately one-third of the test cases in the original CodeContests dataset were incorrect, motivating the development of CodeContests+ and CodeContests-O. Incorrect test cases can lead to both false positives and false negatives.
Narrow problem domain: The dataset focuses exclusively on competitive programming, which represents a specific style of algorithmic problem-solving. It does not cover software engineering tasks like debugging, code review, refactoring, or building multi-file applications.
Platform bias: The validation and test sets draw exclusively from Codeforces, which may introduce stylistic biases specific to that platform's problem-setting conventions.
Language coverage: While the dataset includes solutions in C++, Python, and Java, competitive programming is heavily biased toward C++ due to its performance characteristics. The distribution of solutions does not reflect general software development language usage.
Evaluation conditions: The n@k evaluation metric, which allows models to generate thousands or millions of candidate solutions and submit only the best ones, does not directly map to how human programmers work. The computational cost of generating millions of samples is substantial.