**
| LiveCodeBench | |
|---|---|
| Overview | |
| Full name | Live Code Benchmark |
| Abbreviation | LCB |
| Description | A holistic and contamination-free evaluation benchmark for code LLMs with continuous updates |
| Release date | 2024-03 |
| Latest version | v6 |
| Benchmark updated | 2025-04 |
| Authors | Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, Ion Stoica |
| Organization | UC Berkeley, MIT, Cornell |
| Technical Details | |
| Type | Code Generation, Code Understanding, Multi-task |
| Modality | Text (Code) |
| Task format | Code generation, self-repair, test output prediction, code execution |
| Number of tasks | 1055+ (as of v6) |
| Total examples | 1055+ |
| Evaluation metric | Pass@1, Pass@5, Execution accuracy |
| Domains | Competitive programming |
| Languages | Python, C++, Java, and others |
| Performance | |
| Human performance | Variable by task and difficulty |
| Baseline | ~20-30% (smaller models) |
| SOTA score | 83.3% |
| SOTA model | DeepSeek-V3.2 (Thinking) |
| SOTA date | 2026-01 |
| Saturated | No |
| Resources | |
| Website | Official website |
| Paper | Paper |
| GitHub | Repository |
| Dataset | Hugging Face |
| License | MIT |
| Venue | ICLR 2025 |
LiveCodeBench** is a holistic and contamination-free artificial intelligence benchmark for evaluating large language models on code-related tasks. First released in March 2024 by researchers at UC Berkeley, MIT, and Cornell, the benchmark tackles one of the most persistent problems in code model evaluation: data contamination. Instead of relying on a fixed set of problems that models may have encountered during training, LiveCodeBench continuously collects fresh problems from competitive programming platforms, including LeetCode, AtCoder, and Codeforces. Each problem is tagged with its publication date, allowing evaluators to test models only on problems released after a given model's training cutoff. The benchmark was published as a conference paper at ICLR 2025 and has since become one of the most widely cited benchmarks for measuring coding ability in language models.
By early 2024, the field of code-generating language models had grown rapidly. Benchmarks such as HumanEval, MBPP, and APPS served as standard evaluation tools, but they suffered from a fundamental limitation: they were static. Once a benchmark's problems became publicly available, there was no way to prevent those problems from appearing in future training datasets. As a result, models trained on data collected after a benchmark's release could achieve inflated scores without genuinely improving at coding tasks. This phenomenon, known as data contamination or benchmark leakage, undermined the reliability of published results and made it difficult to compare models fairly.
Several observations highlighted the severity of the contamination problem. Fine-tuned open-source models sometimes achieved high marks on HumanEval while performing poorly on genuinely unseen problems, suggesting that their strong benchmark scores reflected memorization rather than generalization. Meanwhile, closed-source models like GPT-4 maintained more consistent performance across old and new problems, indicating that their capabilities were more robust.
Beyond contamination, existing benchmarks also had a narrow evaluation scope. Most focused exclusively on code generation: given a natural language description, produce a working program. Real-world software development, however, involves debugging, understanding existing code, predicting program behavior, and iterating on failed attempts. A benchmark that measured only generation captured just one facet of what it means to be a capable coding model.
These two problems, contamination and narrow scope, motivated the creation of LiveCodeBench. The researchers set out to build a benchmark that would remain fresh indefinitely through continuous updates and that would measure a broader set of coding capabilities through multiple evaluation scenarios.
LiveCodeBench was developed by a team spanning three universities:
| Author | Affiliation | Role |
|---|---|---|
| Naman Jain | UC Berkeley | Co-lead author |
| King Han | UC Berkeley | Co-lead author |
| Alex Gu | MIT | Contributing author |
| Wen-Ding Li | Cornell | Contributing author |
| Fanjia Yan | UC Berkeley | Contributing author |
| Tianjun Zhang | UC Berkeley | Contributing author |
| Sida Wang | Meta AI | Contributing author |
| Armando Solar-Lezama | MIT | Senior advisor |
| Koushik Sen | UC Berkeley | Senior advisor |
| Ion Stoica | UC Berkeley | Senior advisor |
The project emerged from the UC Berkeley Sky Computing Lab, a research group led by Ion Stoica that focuses on large-scale computing systems and AI infrastructure. The involvement of Armando Solar-Lezama, a leading figure in program synthesis research at MIT, brought additional expertise in automated code reasoning.
LiveCodeBench draws problems from three major competitive programming platforms, each contributing a different style of problem:
| Platform | Problem Style | Contest Frequency | Difficulty Rating System | Test Case Availability |
|---|---|---|---|---|
| LeetCode | Algorithm and data structure puzzles | Weekly and biweekly contests | Easy, Medium, Hard | Full test suites provided |
| AtCoder | Mathematical and algorithmic challenges | Regular rated contests | Beginner to Expert (numeric rating) | Full test suites provided |
| CodeForces | Competitive programming rounds | Bi-weekly rounds (Div 1-4) | Numeric difficulty rating | Partial (long tests are truncated) |
These platforms host regular programming contests where thousands of participants solve problems under time pressure. The large number of participants ensures that each problem is thoroughly vetted for clarity, correctness, and appropriate difficulty before it enters the benchmark.
The LiveCodeBench team built custom HTML scrapers for each of the three platforms. These scrapers collect:
After scraping, the problems go through a filtering step. Problems that require image interpretation are excluded, since the benchmark focuses on text-based code evaluation. Problems that accept multiple valid outputs are also excluded to simplify automated evaluation. For CodeForces, where full test suites are not always provided (long test inputs may be truncated), the researchers semi-automatically construct test case generators. They use the problem specifications to write input generators and validate outputs against correct human solutions.
Each platform has its own difficulty rating system. To create a unified difficulty scale, the researchers map platform-specific ratings into three tiers:
| Tier | LeetCode Rating | AtCoder Rating | CodeForces Rating | Typical Algorithmic Complexity |
|---|---|---|---|---|
| Easy | Easy | ABC A-C | Div 3-4 problems | Basic loops, sorting, simple data structures |
| Medium | Medium | ABC D-F | Div 2 A-C | Dynamic programming, graph traversal, binary search |
| Hard | Hard | ARC/AGC problems | Div 1-2 harder problems | Advanced DP, segment trees, complex combinatorics |
Problems rated above a certain difficulty threshold are excluded from the benchmark because they are too difficult for even the strongest models, which would introduce noise into the evaluation without providing useful signal.
On average, each problem in the benchmark comes with approximately 17 test cases. The number varies by platform:
| Platform | Average Test Cases per Problem |
|---|---|
| LeetCode | 19.0 |
| AtCoder | 15.6 |
| CodeForces | 11.1 |
LiveCodeBench is designed to grow over time as new contest problems become available. The team periodically releases updated versions of the dataset:
| Version | Release Date | Problem Count | Coverage Period |
|---|---|---|---|
| v1 | March 2024 | 400 | May 2023 to March 2024 |
| v2 | May 2024 | 511 | May 2023 to May 2024 |
| v3 | July 2024 | 612 | May 2023 to July 2024 |
| v4 | September 2024 | 713 | May 2023 to September 2024 |
| v5 | January 2025 | 880 | May 2023 to January 2025 |
| v6 | April 2025 | 1,055 | May 2023 to April 2025 |
The initial v1 release contained 400 problems. In the accompanying paper, the researchers reported results on a dataset of 511 problems (later designated v2), with a difficulty distribution of 182 Easy, 206 Medium, and 123 Hard problems. Each subsequent release adds problems from contests that occurred after the previous version's cutoff date.
LiveCodeBench goes beyond simple code generation by evaluating models on four distinct tasks. These tasks were chosen because they represent useful components in real-world code LLM workflows and each has a clear, automated evaluation metric.
This is the primary task and the one most directly comparable to benchmarks like HumanEval. The model receives a natural language problem description, including example input-output pairs, and must produce a complete program that passes all hidden test cases.
For instruction-tuned models, the evaluation uses zero-shot prompts. For base models (those without instruction fine-tuning), a one-shot example is provided to demonstrate the expected output format.
The primary metric is Pass@1, defined as the fraction of problems for which the model generates a correct solution on its first attempt. The researchers also compute Pass@5, which measures whether at least one of five generated solutions is correct. To estimate these metrics, 10 candidate solutions are sampled per problem at temperature 0.2 with top_p set to 0.95.
The self-repair task evaluates a model's ability to debug and fix broken code. The evaluation proceeds in two stages:
The evaluation metric is the combined Pass@1 after the repair step: a problem counts as solved if the model either generated a correct solution on the first attempt or successfully repaired its initial attempt. This mirrors the real-world debugging workflow where a developer writes code, sees test failures, and iterates on the solution.
The code execution task tests whether a model can mentally trace through code and predict its output. The model receives a program snippet along with a specific input and must predict the exact output.
The dataset for this task uses approximately 2,000 correct human-submitted LeetCode solutions. These solutions go through compile-time and runtime filters to ensure they have reasonable complexity and produce deterministic outputs. The final dataset consists of 479 samples drawn from 85 problems.
Two prompting strategies are used:
The paper found that closed-source models benefited from chain-of-thought prompting on this task, while open-source models sometimes performed worse with chain-of-thought, possibly due to difficulties maintaining coherent multi-step reasoning.
In the test output prediction task, the model receives a problem description and a test input, but not a solution. It must predict the correct output purely from its understanding of the problem.
This task is simpler than code generation in one sense (no code needs to be written) but requires deep comprehension of the problem logic. The dataset contains 442 instances drawn from 181 LeetCode problems. Evaluation uses a zero-shot prompt that asks the model to complete an assertion with the expected output value.
| Task | Input Given to Model | Expected Output | Evaluation Metric | Dataset Size | Real-World Analogy |
|---|---|---|---|---|---|
| Code Generation | Problem description + examples | Working program | Pass@1, Pass@5 | 511 problems (v2) | Writing new code from a specification |
| Self-Repair | Problem + buggy code + error feedback | Corrected program | Combined Pass@1 | Same as code generation | Debugging a failing test |
| Code Execution | Program + input | Predicted output | Exact match accuracy | 479 samples from 85 problems | Code review and tracing |
| Test Output Prediction | Problem description + test input | Expected output | Exact match accuracy | 442 instances from 181 problems | Understanding requirements without coding |
The core contamination prevention mechanism in LiveCodeBench is temporal windowing. Every problem in the benchmark is tagged with the date it was published on its source platform. When evaluating a model, the evaluator can filter problems to include only those released after the model's known training data cutoff. This means that even if older problems have leaked into training data, the evaluation can still produce an uncontaminated score by restricting the evaluation window to newer problems.
For example, if a model's training data ends in September 2023, the evaluator can restrict the evaluation to problems released from October 2023 onward. This guarantees that the model has never seen any of the evaluation problems during training.
Beyond prevention, LiveCodeBench also serves as a contamination detection tool. By comparing a model's performance on problems from different time periods, researchers can identify suspicious performance patterns that suggest data leakage.
The original paper identified several notable contamination patterns:
| Model | Observed Pattern | Likely Explanation |
|---|---|---|
| DeepSeek-Coder | Sharp performance drop on LeetCode problems released after August 2023 | Training data likely included LeetCode problems up to the model's release date |
| GPT-4o | Performance decline on problems released after November 2023 | Training cutoff alignment |
| Codestral | Pass@1 dropped from 36.5% on older problems to 28.3% on newer problems | Training data contamination on earlier problems |
| Claude 3 | Performance drop on problems after respective training date | Cutoff-aligned contamination |
These findings demonstrated that contamination is not limited to open-source models; even major closed-source models show signs of having been exposed to competitive programming problems that appeared online before their training cutoff dates.
Interestingly, the contamination effect was strongest for LeetCode problems and weaker for AtCoder and CodeForces problems. This likely reflects the fact that LeetCode problems and solutions are more widely shared on the open web (through blog posts, GitHub repositories, and discussion forums), making them more likely to appear in web-scraped training data.
All evaluations use a consistent set of generation parameters:
| Parameter | Value |
|---|---|
| Number of samples per problem (n) | 10 |
| Temperature | 0.2 |
| Top-p | 0.95 |
| Metric computation | Unbiased estimator for Pass@1 and Pass@5 |
The low temperature (0.2) ensures relatively deterministic outputs while still allowing some diversity across samples. The unbiased estimator for Pass@k, originally introduced by the Codex paper from OpenAI, computes the metric from the 10 samples without requiring expensive repeated sampling.
Generated code is executed in a sandboxed environment to validate functional correctness. The researchers use a modified version of the checker from the APPS benchmark, with identified edge cases fixed and the checker simplified for the LiveCodeBench dataset.
LiveCodeBench is open-source under the MIT license. To use it:
# Clone the repository
git clone https://github.com/LiveCodeBench/LiveCodeBench
cd LiveCodeBench
# Install dependencies
pip install -e .
# Download a specific dataset version
python scripts/download_data.py --version v6
Evaluations can be run programmatically:
from livecodebench import LiveCodeBench
# Initialize the benchmark
lcb = LiveCodeBench(version='v6')
# Evaluate on problems released after a specific date
results = lcb.evaluate(
model='gpt-4',
tasks=['code_generation', 'self_repair'],
date_range=('2024-01-01', '2025-04-01')
)
# Filter by difficulty
easy_results = lcb.filter_problems(difficulty='easy')
The original LiveCodeBench paper evaluated 18 base LLMs and 34 instruction-tuned LLMs, making it one of the largest evaluation studies of code models on competitive programming problems at the time of publication.
In the original evaluation on v2 data, the top-performing models were:
| Model | Pass@1 (Code Generation) | Category |
|---|---|---|
| GPT-4-Turbo | Highest among all models | Closed-source |
| Claude 3 Opus | Second highest, close to GPT-4-Turbo | Closed-source |
| GPT-4 | Strong but slightly below Turbo variant | Closed-source |
| DeepSeek-Coder-Instruct 33B | Best open-source model | Open-source |
| Llama 3 Instruct 70B | Strong open-source performer | Open-source |
The paper found a significant performance gap between closed-source and open-source models. Only the strongest instruction-tuned variants of models with more than 30 billion parameters (such as Llama 3 Instruct 70B, Mixtral, and DeepSeek-Instruct 33B) came close to bridging this gap.
A key finding was the moderate correlation (r = 0.72) between HumanEval scores and LiveCodeBench scores. The researchers identified two distinct clusters of models:
For instance, DeepSeek-Instruct 33B trailed GPT-4-Turbo by only 4.3 percentage points on HumanEval+ but by 16.2 points on LiveCodeBench Easy problems. This disparity highlights how static benchmarks can paint a misleading picture of model capabilities.
As LiveCodeBench has continued to be updated, it has become a standard evaluation benchmark tracked by multiple leaderboard platforms. The following table shows top-performing models as of early 2026, evaluated on the code generation task:
| Rank | Model | Organization | Pass@1 Score |
|---|---|---|---|
| 1 | DeepSeek-V3.2 (Thinking) | DeepSeek | 83.3% |
| 2 | MiniMax M2 | MiniMax | 83.0% |
| 3 | LongCat-Flash-Thinking-2601 | Meituan | 82.8% |
| 4 | Nemotron 3 Super (120B A12B) | NVIDIA | 81.2% |
| 5 | Grok-3 Mini | xAI | 80.4% |
| 6 | Grok 4 Fast | xAI | 80.0% |
| 7 | Grok-3 | xAI | 79.4% |
| 8 | Grok-4 Heavy | xAI | 79.4% |
| 9 | LongCat-Flash-Thinking | Meituan | 79.4% |
| 10 | Grok-4 | xAI | 79.0% |
| 11 | MiniMax M2.1 | MiniMax | 78.0% |
| 12 | DeepSeek-V3.2-Exp | DeepSeek | 74.1% |
| 13 | DeepSeek-R1-0528 | DeepSeek | 73.3% |
| 14 | GLM-4.5 | Zhipu AI | 72.9% |
| 15 | Nemotron Nano 9B v2 | NVIDIA | 71.1% |
On third-party aggregators like Artificial Analysis, which evaluate models using their own testing infrastructure, the top scores are even higher, with Gemini 3 Pro Preview reaching 91.7% and Gemini 3 Flash Preview (Reasoning) reaching 90.8%. These differences likely reflect variations in evaluation methodology, prompting strategy, and the specific subset of problems used.
A significant trend visible in the 2025-2026 leaderboard is the dominance of reasoning-enabled models. Models with explicit chain-of-thought or "thinking" modes (such as DeepSeek-V3.2 Thinking, the o-series from OpenAI, and Gemini reasoning variants) consistently outperform their non-reasoning counterparts. The improvement is most pronounced on problems that require structured algorithmic thinking, such as combinatorics and dynamic programming. However, reasoning provides limited gains on problems that require careful observation of edge cases or complex pattern matching.
In June 2025, a related but distinct benchmark called LiveCodeBench Pro was introduced in a paper titled "LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming?" (arXiv: 2506.11928). While the original LiveCodeBench targets a broad range of difficulty levels, LiveCodeBench Pro focuses specifically on challenging problems from elite competitive programming contests.
| Feature | LiveCodeBench | LiveCodeBench Pro |
|---|---|---|
| Problem sources | LeetCode, AtCoder, CodeForces | Codeforces, ICPC, IOI |
| Problem count | 1,055+ (v6) | 584 |
| Difficulty focus | Easy to Hard | Medium to Expert (Olympiad-level) |
| Rating system | Platform-specific difficulty tiers | Codeforces-style Elo ratings |
| Annotation | Automated with manual review | Olympiad medalist annotations |
| Error analysis | Automated | Line-by-line medalist review of failures |
LiveCodeBench Pro uses Bayesian Elo ratings to assess model performance, making scores directly comparable to human competitive programmers on platforms like Codeforces. Problems span three tiers based on Elo:
| Tier | Elo Range | Description |
|---|---|---|
| Easy | Up to 2000 | Standard competitive programming problems |
| Medium | 2000 to 3000 | Advanced algorithmic challenges |
| Hard | Above 3000 | Olympiad-level problems requiring deep mathematical reasoning |
LiveCodeBench Pro revealed stark limitations in current models. Without external tools, the best model achieved only 53% Pass@1 on medium-difficulty problems and 0% on hard problems. Allowing multiple attempts (Pass@10) substantially improved performance, with some models gaining over 500 Elo points. The benchmark also introduced a cognitive-focus taxonomy, categorizing problems as "knowledge-heavy" (requiring implementation of known algorithmic templates) or "logic-heavy" (requiring systematic mathematical reasoning). Models performed significantly better on knowledge-heavy problems than on logic-heavy ones.
In May 2025, the LiveCodeBench team also launched GSO (Global Software Optimization), a separate benchmark focused on software optimization rather than algorithmic problem-solving. GSO presents models with codebases and performance tests, tasking them to improve runtime efficiency. The benchmark includes 102 optimization tasks across 10 codebases and uses the Opt@1 metric, measuring the fraction of tasks where a single attempt achieves at least 95% of the speedup that a human expert achieved. GSO also introduced a "Hack Detector" system to identify and penalize deceptive optimizations such as memoization tricks or test harness hijacking.
LiveCodeBench occupies a specific niche in the landscape of code evaluation benchmarks:
| Benchmark | Problem Type | Size | Dynamic Updates | Contamination Prevention | Tasks Evaluated |
|---|---|---|---|---|---|
| HumanEval | Hand-written Python puzzles | 164 | No | None | Code generation only |
| MBPP | Crowd-sourced Python problems | ~1,000 | No | None | Code generation only |
| APPS | Algorithmic problems | 10,000 | No | None | Code generation only |
| CodeContests | Competition problems | ~13,000 | No | None | Code generation only |
| SWE-bench | Real GitHub issues | 2,294 | Periodic | Limited | End-to-end software engineering |
| BigCodeBench | Complex function-level tasks | 1,140 | No | None | Code generation, function completion |
| LiveCodeBench | Competition problems | 1,055+ | Continuous | Temporal windowing | Generation, repair, execution, prediction |
LiveCodeBench's primary advantages are its continuous update mechanism and multi-task evaluation scope. Its primary limitation compared to benchmarks like SWE-bench is that it focuses on self-contained algorithmic problems rather than real-world software engineering tasks that require navigating large codebases, understanding project context, and writing tests.
Since its release, LiveCodeBench has seen broad adoption across both academia and industry:
Despite its contributions, LiveCodeBench has several recognized limitations:
| Limitation | Description | Impact |
|---|---|---|
| Platform dependency | Relies on external contest platforms for new problems | Data availability is tied to contest schedules |
| Algorithmic focus | Problems are primarily algorithmic puzzles from competitions | May not reflect practical software engineering tasks |
| Language bias | While multiple languages are supported, most evaluation focuses on Python | Limited assessment of language-specific capabilities |
| Execution cost | Sandboxed execution of generated code requires significant compute | Resource-intensive evaluation |
| Difficulty ceiling | Problems above a certain difficulty are excluded | Does not test the absolute upper bound of model capability |
| Update irregularity | New problems depend on contest schedules across three platforms | Uneven temporal coverage |
The LiveCodeBench team has outlined several directions for future development: