LiveCodeBench

**

LiveCodeBench
Overview
Full name	Live Code Benchmark
Abbreviation	LCB
Description	A holistic and contamination-free evaluation benchmark for code LLMs with continuous updates
Release date	2024-03
Latest version	v6
Benchmark updated	2025-05
Authors	Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, Ion Stoica
Organization	UC Berkeley, MIT, Cornell
Technical Details
Type	Code Generation, Code Understanding, Multi-task
Modality	Text (Code)
Task format	Code generation, self-repair, test output prediction, code execution
Number of tasks	1,055+ (as of v6)
Total examples	1,055+
Evaluation metric	Pass@1, Pass@5, Execution accuracy
Domains	Competitive programming
Languages	Python, C++, Java, and others
Performance
Human performance	Variable by task and difficulty
Baseline	~20-30% (smaller models)
SOTA score	93.5% (Pass@1)
SOTA model	DeepSeek V4 Pro Max
SOTA date	2026-04
Saturated	No
Resources
Website	Official website
Paper	Paper
GitHub	Repository
Dataset	Hugging Face
License	MIT
Venue	ICLR 2025

LiveCodeBench** is a holistic and contamination-free artificial intelligence benchmark for evaluating large language models on code-related tasks. First released in March 2024 by researchers at UC Berkeley, MIT, and Cornell, the benchmark tackles one of the most persistent problems in code model evaluation: data contamination. Instead of relying on a fixed set of problems that models may have encountered during training, LiveCodeBench continuously collects fresh problems from competitive programming platforms, including LeetCode, AtCoder, and Codeforces. Each problem is tagged with its publication date, allowing evaluators to test models only on problems released after a given model's training cutoff. The benchmark was published as a conference paper at ICLR 2025 and has since become one of the most widely cited benchmarks for measuring coding ability in language models.

Background and motivation

By early 2024, the field of code-generating language models had grown rapidly. Benchmarks such as HumanEval, MBPP, and APPS served as standard evaluation tools, but they suffered from a fundamental limitation: they were static. Once a benchmark's problems became publicly available, there was no way to prevent those problems from appearing in future training datasets. As a result, models trained on data collected after a benchmark's release could achieve inflated scores without genuinely improving at coding tasks. This phenomenon, known as data contamination or benchmark leakage, undermined the reliability of published results and made it difficult to compare models fairly.

Several observations highlighted the severity of the contamination problem. Fine-tuned open-source models sometimes achieved high marks on HumanEval while performing poorly on genuinely unseen problems, suggesting that their strong benchmark scores reflected memorization rather than generalization. Meanwhile, closed-source models like GPT-4 maintained more consistent performance across old and new problems, indicating that their capabilities were more robust.

Beyond contamination, existing benchmarks also had a narrow evaluation scope. Most focused exclusively on code generation: given a natural language description, produce a working program. Real-world software development, however, involves debugging, understanding existing code, predicting program behavior, and iterating on failed attempts. A benchmark that measured only generation captured just one facet of what it means to be a capable coding model.

These two problems, contamination and narrow scope, motivated the creation of LiveCodeBench. The researchers set out to build a benchmark that would remain fresh indefinitely through continuous updates and that would measure a broader set of coding capabilities through multiple evaluation scenarios.

Authors and institutional background

LiveCodeBench was developed by a team spanning three universities:

Author	Affiliation	Role
Naman Jain	UC Berkeley	Co-lead author
King Han	UC Berkeley	Co-lead author
Alex Gu	MIT	Contributing author
Wen-Ding Li	Cornell	Contributing author
Fanjia Yan	UC Berkeley	Contributing author
Tianjun Zhang	UC Berkeley	Contributing author
Sida Wang	Meta AI	Contributing author
Armando Solar-Lezama	MIT	Senior advisor
Koushik Sen	UC Berkeley	Senior advisor
Ion Stoica	UC Berkeley	Senior advisor

The project emerged from the UC Berkeley Sky Computing Lab, a research group led by Ion Stoica that focuses on large-scale computing systems and AI infrastructure. The involvement of Armando Solar-Lezama, a leading figure in program synthesis research at MIT, brought additional expertise in automated code reasoning.

Problem collection process

Source platforms

LiveCodeBench draws problems from three major competitive programming platforms, each contributing a different style of problem:

Platform	Problem style	Contest frequency	Difficulty rating system	Test case availability
LeetCode	Algorithm and data structure puzzles	Weekly and biweekly contests	Easy, Medium, Hard	Full test suites provided
AtCoder	Mathematical and algorithmic challenges	Regular rated contests	Beginner to Expert (numeric rating)	Full test suites provided
Codeforces	Competitive programming rounds	Bi-weekly rounds (Div 1-4)	Numeric difficulty rating	Partial (long tests are truncated)

These platforms host regular programming contests where thousands of participants solve problems under time pressure. The large number of participants ensures that each problem is thoroughly vetted for clarity, correctness, and appropriate difficulty before it enters the benchmark.

Scraping and filtering pipeline

The LiveCodeBench team built custom HTML scrapers for each of the three platforms. These scrapers collect:

The natural language problem statement
Example input-output pairs
Hidden test cases (where available)
Contest metadata, including the exact date and time of the contest
Difficulty ratings assigned by the platform

After scraping, the problems go through a filtering step. Problems that require image interpretation are excluded, since the benchmark focuses on text-based code evaluation. Problems that accept multiple valid outputs are also excluded to simplify automated evaluation. For Codeforces, where full test suites are not always provided (long test inputs may be truncated), the researchers semi-automatically construct test case generators. They use the problem specifications to write input generators and validate outputs against correct human solutions.

Difficulty classification

Each platform has its own difficulty rating system. To create a unified difficulty scale, the researchers map platform-specific ratings into three tiers:

Tier	LeetCode rating	AtCoder rating	Codeforces rating	Typical algorithmic complexity
Easy	Easy	ABC A-C	Div 3-4 problems	Basic loops, sorting, simple data structures
Medium	Medium	ABC D-F	Div 2 A-C	Dynamic programming, graph traversal, binary search
Hard	Hard	ARC/AGC problems	Div 1-2 harder problems	Advanced DP, segment trees, complex combinatorics

Problems rated above a certain difficulty threshold are excluded from the benchmark because they are too difficult for even the strongest models, which would introduce noise into the evaluation without providing useful signal.

Test case statistics

On average, each problem in the benchmark comes with approximately 17 test cases. The number varies by platform:

Platform	Average test cases per problem
LeetCode	19.0
AtCoder	15.6
Codeforces	11.1

Dataset versions and growth

LiveCodeBench is designed to grow over time as new contest problems become available. The team periodically releases updated versions of the dataset:

Version	Release date	Problem count	Coverage period
v1	March 2024	400	May 2023 to March 2024
v2	May 2024	511	May 2023 to May 2024
v3	July 2024	612	May 2023 to July 2024
v4	September 2024	713	May 2023 to September 2024
v5	January 2025	880	May 2023 to January 2025
v6	May 2025	1,055	May 2023 to May 2025

The initial v1 release contained 400 problems. In the accompanying paper, the researchers reported results on a dataset of 511 problems (later designated v2), with a difficulty distribution of 182 Easy, 206 Medium, and 123 Hard problems. Each subsequent release adds problems from contests that occurred after the previous version's cutoff date.

The v5 and v6 evaluation windows

By 2025, the community converged on reporting Pass@1 on problems released between August 2024 and the version cutoff. For v5 this typically meant August 2024 to January 2025; v6 standardized on August 2024 to May 2025 (454 fresh problems for contamination-free testing of any model with a training cutoff of July 2024 or earlier). The May 2025 v6 release also refreshed the official leaderboard: submissions now include a declared training cutoff, and the site reports both a full-dataset score and a contamination-controlled score. The gap between the two figures is a widely-cited indicator of potential leakage.

Evaluation tasks

LiveCodeBench goes beyond simple code generation by evaluating models on four distinct tasks. These tasks were chosen because they represent useful components in real-world code LLM workflows and each has a clear, automated evaluation metric.

Code generation

This is the primary task and the one most directly comparable to benchmarks like HumanEval. The model receives a natural language problem description, including example input-output pairs, and must produce a complete program that passes all hidden test cases.

For instruction-tuned models, the evaluation uses zero-shot prompts. For base models (those without instruction fine-tuning), a one-shot example is provided to demonstrate the expected output format.

The primary metric is Pass@1, defined as the fraction of problems for which the model generates a correct solution on its first attempt. The researchers also compute Pass@5, which measures whether at least one of five generated solutions is correct. To estimate these metrics, 10 candidate solutions are sampled per problem at temperature 0.2 with top_p set to 0.95.

Self-repair

The self-repair task evaluates a model's ability to debug and fix broken code. The evaluation proceeds in two stages:

The model first attempts to generate a solution (as in the code generation task).
If the generated code fails any test case, the model receives the original problem description, its incorrect solution, the failing test case, and the error message or execution feedback.
The model must then produce a corrected version of the program.

The evaluation metric is the combined Pass@1 after the repair step: a problem counts as solved if the model either generated a correct solution on the first attempt or successfully repaired its initial attempt. This mirrors the real-world debugging workflow where a developer writes code, sees test failures, and iterates on the solution.

Code execution

The code execution task tests whether a model can mentally trace through code and predict its output. The model receives a program snippet along with a specific input and must predict the exact output.

The dataset for this task uses approximately 2,000 correct human-submitted LeetCode solutions. These solutions go through compile-time and runtime filters to ensure they have reasonable complexity and produce deterministic outputs. The final dataset consists of 479 samples drawn from 85 problems.

Two prompting strategies are used:

A 2-shot prompt without chain-of-thought reasoning
A 1-shot prompt with detailed chain-of-thought steps showing how to trace through code manually

The paper found that closed-source models benefited from chain-of-thought prompting on this task, while open-source models sometimes performed worse with chain-of-thought, possibly due to difficulties maintaining coherent multi-step reasoning.

Test output prediction

In the test output prediction task, the model receives a problem description and a test input, but not a solution. It must predict the correct output purely from its understanding of the problem.

This task is simpler than code generation in one sense (no code needs to be written) but requires deep comprehension of the problem logic. The dataset contains 442 instances drawn from 181 LeetCode problems. Evaluation uses a zero-shot prompt that asks the model to complete an assertion with the expected output value.

Task comparison table

Task	Input given to model	Expected output	Evaluation metric	Dataset size	Real-world analogy
Code Generation	Problem description + examples	Working program	Pass@1, Pass@5	511 problems (v2)	Writing new code from a specification
Self-Repair	Problem + buggy code + error feedback	Corrected program	Combined Pass@1	Same as code generation	Debugging a failing test
Code Execution	Program + input	Predicted output	Exact match accuracy	479 samples from 85 problems	Code review and tracing
Test Output Prediction	Problem description + test input	Expected output	Exact match accuracy	442 instances from 181 problems	Understanding requirements without coding

Contamination prevention

Temporal windowing

The core contamination prevention mechanism in LiveCodeBench is temporal windowing. Every problem in the benchmark is tagged with the date it was published on its source platform. When evaluating a model, the evaluator can filter problems to include only those released after the model's known training data cutoff. This means that even if older problems have leaked into training data, the evaluation can still produce an uncontaminated score by restricting the evaluation window to newer problems.

For example, if a model's training data ends in September 2023, the evaluator can restrict the evaluation to problems released from October 2023 onward. This guarantees that the model has never seen any of the evaluation problems during training.

Contamination detection

Beyond prevention, LiveCodeBench also serves as a contamination detection tool. By comparing a model's performance on problems from different time periods, researchers can identify suspicious performance patterns that suggest data leakage.

The original paper identified several notable contamination patterns:

Model	Observed pattern	Likely explanation
DeepSeek-Coder	Sharp performance drop on LeetCode problems released after August 2023	Training data likely included LeetCode problems up to the model's release date
GPT-4o	Performance decline on problems released after November 2023	Training cutoff alignment
Codestral	Pass@1 dropped from 36.5% on older problems to 28.3% on newer problems	Training data contamination on earlier problems
Claude 3	Performance drop on problems after respective training date	Cutoff-aligned contamination

These findings demonstrated that contamination is not limited to open-source models; even major closed-source models show signs of having been exposed to competitive programming problems that appeared online before their training cutoff dates.

Contamination across platforms

Interestingly, the contamination effect was strongest for LeetCode problems and weaker for AtCoder and Codeforces problems. This likely reflects the fact that LeetCode problems and solutions are more widely shared on the open web (through blog posts, GitHub repositories, and discussion forums), making them more likely to appear in web-scraped training data.

2025-2026 contamination findings

Four patterns emerged in the 2025 evaluation cycle: a pre-cutoff bump (common across labs in earlier Llama and Qwen checkpoints), a mid-cutoff plateau (rare in frontier models), post-cutoff stability (increasingly common in DeepSeek V3.2 Thinking, Gemini 3 Pro, and the Claude family), and anomalous post-cutoff gains where a model scores higher on problems after its declared cutoff than before it. The fourth pattern is treated as a red flag, often indicating undeclared post-training data refresh or harness errors. The maintainers added a dedicated audit page in early 2026 to track such anomalies.

Evaluation setup and implementation

Generation parameters

All evaluations use a consistent set of generation parameters:

Parameter	Value
Number of samples per problem (n)	10
Temperature	0.2
Top-p	0.95
Metric computation	Unbiased estimator for Pass@1 and Pass@5

The low temperature (0.2) ensures relatively deterministic outputs while still allowing some diversity across samples. The unbiased estimator for Pass@k, originally introduced by the Codex paper from OpenAI, computes the metric from the 10 samples without requiring expensive repeated sampling.

For reasoning-enabled models, the maintainers expanded the sampling protocol in late 2025. Reasoning traces are no longer scored against a token budget cap, but the final code submission must fit within a 16,384-token answer block. Some labs report scores at higher temperatures (0.7 or 1.0) for reasoning models; the official leaderboard tags these as reasoning-pass@1 entries.

Execution environment

Generated code is executed in a sandboxed environment to validate functional correctness. The researchers use a modified version of the checker from the APPS benchmark, with identified edge cases fixed and the checker simplified for the LiveCodeBench dataset.

Installation and usage

LiveCodeBench is open-source under the MIT license. To use it:

# Clone the repository
git clone https://github.com/LiveCodeBench/LiveCodeBench
cd LiveCodeBench

# Install dependencies
pip install -e .

# Download a specific dataset version
python scripts/download_data.py --version v6

Evaluations can be run programmatically:

from livecodebench import LiveCodeBench

# Initialize the benchmark
lcb = LiveCodeBench(version='v6')

# Evaluate on problems released after a specific date
results = lcb.evaluate(
    model='gpt-4',
    tasks=['code_generation', 'self_repair'],
    date_range=('2024-08-01', '2025-05-01')
)

# Filter by difficulty
easy_results = lcb.filter_problems(difficulty='easy')

Key findings from the original paper

The original LiveCodeBench paper evaluated 18 base LLMs and 34 instruction-tuned LLMs, making it one of the largest evaluation studies of code models on competitive programming problems at the time of publication.

Model rankings (original paper, 2024)

In the original evaluation on v2 data, the top-performing models were:

Model	Pass@1 (code generation)	Category
GPT-4-Turbo	Highest among all models	Closed-source
Claude 3 Opus	Second highest, close to GPT-4-Turbo	Closed-source
GPT-4	Strong but slightly below Turbo variant	Closed-source
DeepSeek-Coder-Instruct 33B	Best open-source model	Open-source
Llama 3 Instruct 70B	Strong open-source performer	Open-source

The paper found a significant performance gap between closed-source and open-source models. Only the strongest instruction-tuned variants of models with more than 30 billion parameters (such as Llama 3 Instruct 70B, Mixtral, and DeepSeek-Instruct 33B) came close to bridging this gap.

HumanEval vs. LiveCodeBench correlation

A key finding was the moderate correlation (r = 0.72) between HumanEval scores and LiveCodeBench scores. The researchers identified two distinct clusters of models:

Cluster 1: Base models and closed-source models that performed similarly on both benchmarks, indicating genuine coding ability.
Cluster 2: Fine-tuned open-source models that scored well on HumanEval but poorly on LiveCodeBench, suggesting overfitting to HumanEval's specific problem set.

For instance, DeepSeek-Instruct 33B trailed GPT-4-Turbo by only 4.3 percentage points on HumanEval+ but by 16.2 points on LiveCodeBench Easy problems. This disparity highlights how static benchmarks can paint a misleading picture of model capabilities.

Task-specific observations

Code generation was the task where models performed best overall.
Self-repair showed significant performance drops compared to generation, indicating that debugging remains harder than initial code writing for current models.
Code execution with chain-of-thought prompting helped closed-source models but hurt some open-source models.
Test output prediction performance varied widely, with some models showing unexpectedly strong problem comprehension despite weaker code generation scores.

Current leaderboard (2026)

As LiveCodeBench has continued to be updated, it has become a standard evaluation benchmark tracked by multiple leaderboard platforms. The following table reflects top-performing models as of mid-2026 on the v6 evaluation window (August 2024 to May 2025), reported on the official site and major third-party aggregators:

Rank	Model	Organization	Pass@1 score
1	DeepSeek V4 Pro Max	DeepSeek	93.5%
2	Gemini 3 Pro Preview	Google DeepMind	91.7%
3	DeepSeek V4 Flash Max	DeepSeek	91.6%
4	Gemini 3 Flash Preview (Reasoning)	Google DeepMind	90.8%
5	DeepSeek V3.2 Speciale	DeepSeek	89.6%
6	Claude Opus 4.7	Anthropic	~88.8% (third-party reported)
7	Step-3.5-Flash	StepFun	86.4%
8	DeepSeek V3.2 (Thinking)	DeepSeek	83.3%
9	MiniMax M2	MiniMax	83.0%
10	LongCat-Flash-Thinking-2601	Meituan	82.8%
11	Nemotron 3 Super (120B A12B)	NVIDIA	81.2%
12	Grok 3 Mini	xAI	80.4%
13	Grok 4 Fast	xAI	80.0%
14	Grok 3	xAI	79.4%
15	Grok 4 Heavy	xAI	79.4%

DeepSeek V4 Pro Max set the current high-water mark in April 2026, the first Pass@1 above 93% on the v6 contamination-controlled window. DeepSeek's V4 family extended a streak that began with V3.2 Thinking. Google DeepMind's Gemini 3 series followed closely. GPT-5.5, released by OpenAI in April 2026, has not had a stable LiveCodeBench number published as of mid-May 2026; OpenAI cited internal evaluations on private problem sets at launch. Qwen3-Max from Alibaba is similarly absent from the official site, though third-party aggregators list partial results.

Lab-specific 2026 highlights

Lab	Flagship model	LiveCodeBench Pass@1 (v6 window)	Notes
DeepSeek	DeepSeek V4 Pro Max	93.5%	First model above 93% on the contamination-controlled window
Google DeepMind	Gemini 3 Pro Preview	91.7%	Strongest closed-source model on Artificial Analysis
Anthropic	Claude Opus 4.7	~88.8% (third-party)	Anthropic has not published an official LCB figure
OpenAI	GPT-5.5	Not officially reported	Launch focused on internal coding-agent metrics rather than LCB
xAI	Grok 4 Heavy	79.4%	Reasoning variant; non-reasoning Grok 4 scored slightly lower
Alibaba	Qwen3-Max	Not on official leaderboard	Third-party evaluations indicate competitive performance

The mid-2026 picture is one of intense competition between Chinese frontier labs (DeepSeek, MiniMax, Meituan, Qwen) and the leading US labs (Google DeepMind, Anthropic, OpenAI, xAI). The official site continues to insist on the contamination-controlled window for ranking, while third-party aggregators offer broader pictures.

Discrepancies between leaderboards

LiveCodeBench scores reported in 2025 and 2026 vary noticeably depending on the source:

Source	Methodology highlights	Typical difference vs. official
Official LiveCodeBench	Public test cases, Pass@1 over 10 samples at temp 0.2	Baseline
Artificial Analysis	Single-sample evaluation with vendor-provided prompts	Often higher by 2-6 points
llm-stats.com	Aggregated from vendor reports and third-party reruns	Closely tracks official
PricePerToken	Standardized API harness; reports cost-per-pass	Similar to official
BenchLM	Composite scoring with SWE-bench cross-reference	Often lower; stricter filters

These differences can change rankings. Gemini 3 Pro Preview holds the top spot on Artificial Analysis with 91.7% but ranks third on the official site, where DeepSeek V4 Pro Max leads at 93.5%. The maintainers urge consumers to specify which leaderboard they cite, and most peer-reviewed papers in 2026 do so.

The rise of reasoning models

A significant trend visible in the 2025-2026 leaderboard is the dominance of reasoning-enabled models. Models with explicit chain-of-thought or "thinking" modes (such as DeepSeek V3.2 Thinking, DeepSeek V4 Pro Max, Gemini 3 Pro reasoning variants, and Claude Opus 4.7 with extended thinking) consistently outperform their non-reasoning counterparts. The improvement is most pronounced on problems that require structured algorithmic thinking, such as combinatorics and dynamic programming. However, reasoning provides limited gains on problems that require careful observation of edge cases or complex pattern matching.

Empirical analysis on v6 found that the gap between a model's reasoning and non-reasoning modes typically widens with problem difficulty. On Easy problems the gap is often under 5 percentage points; on Medium problems it grows to 10-15 points; on Hard problems it can exceed 25 points. This pattern matches the prediction that extended deliberation helps most when the solution search space is large and verifying candidates requires multi-step reasoning.

LiveCodeBench Pro

In June 2025, a related but distinct benchmark called LiveCodeBench Pro was introduced in a paper titled "LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming?" (arXiv: 2506.11928). While the original LiveCodeBench targets a broad range of difficulty levels, LiveCodeBench Pro focuses specifically on challenging problems from elite competitive programming contests.

Key differences from the original

Feature	LiveCodeBench	LiveCodeBench Pro
Problem sources	LeetCode, AtCoder, Codeforces	Codeforces, ICPC, IOI
Problem count	1,055+ (v6)	584
Difficulty focus	Easy to Hard	Medium to Expert (Olympiad-level)
Rating system	Platform-specific difficulty tiers	Codeforces-style Elo ratings
Annotation	Automated with manual review	Olympiad medalist annotations
Error analysis	Automated	Line-by-line medalist review of failures

Elo rating system

LiveCodeBench Pro uses Bayesian Elo ratings to assess model performance, making scores directly comparable to human competitive programmers on platforms like Codeforces. Problems span three tiers based on Elo:

Tier	Elo range	Description
Easy	Up to 2000	Standard competitive programming problems
Medium	2000 to 3000	Advanced algorithmic challenges
Hard	Above 3000	Olympiad-level problems requiring deep mathematical reasoning

Findings from the original release

LiveCodeBench Pro revealed stark limitations in current models. Without external tools, the best model achieved only 53% Pass@1 on medium-difficulty problems and 0% on hard problems. Allowing multiple attempts (Pass@10) substantially improved performance, with some models gaining over 500 Elo points. The benchmark also introduced a cognitive-focus taxonomy, categorizing problems as "knowledge-heavy" (requiring implementation of known algorithmic templates) or "logic-heavy" (requiring systematic mathematical reasoning). Models performed significantly better on knowledge-heavy problems than on logic-heavy ones.

2026 Elo leaderboard

LiveCodeBench Pro has emerged as the more discriminating benchmark for frontier models, since the original LiveCodeBench begins to compress at the top. As of mid-2026 the Elo rankings are:

Rank	Model	Lab	Pro Elo rating
1	Gemini 3.1 Pro	Google DeepMind	2,887
2	Gemini 3 Pro	Google DeepMind	2,439
3	Gemini 3 Flash	Google DeepMind	2,316
4	DeepSeek V4 Pro Max	DeepSeek	~2,260 (estimated)
5	Claude Opus 4.7 (extended thinking)	Anthropic	~2,180 (third-party)
6	DeepSeek V3.2 (Thinking)	DeepSeek	~2,050
7	Grok 4 Heavy	xAI	~1,950

A 2,887 Elo rating places Gemini 3.1 Pro above the typical international master threshold on Codeforces, though still below grandmaster level. Even at this level, models still fail completely on the hardest Olympiad problems that require novel algorithmic insight.

LiveCodeBench Pro v2 update

In early 2026, LiveCodeBench Pro v2 added ICPC World Finals 2025 problems, recent IOI problems, and Chinese national olympiad training problems, bringing the total to roughly 780. The medalist-review pipeline now annotates each model failure with one of seven modes (off-by-one, complexity blow-up, missed corner case, wrong invariant, output format, missing observation, or implementation bug). The lab-by-lab breakdown has become a public diagnostic for frontier-model weaknesses.

GSO: Global Software Optimization

In May 2025, the LiveCodeBench team also launched GSO (Global Software Optimization), a separate benchmark focused on software optimization rather than algorithmic problem-solving. GSO presents models with codebases and performance tests, tasking them to improve runtime efficiency. The benchmark includes 102 optimization tasks across 10 codebases and uses the Opt@1 metric, measuring the fraction of tasks where a single attempt achieves at least 95% of the speedup that a human expert achieved. GSO also introduced a "Hack Detector" system to identify and penalize deceptive optimizations such as memoization tricks or test harness hijacking.

Comparison with other code benchmarks

LiveCodeBench occupies a specific niche in the landscape of code evaluation benchmarks:

Benchmark	Problem type	Size	Dynamic updates	Contamination prevention	Tasks evaluated
HumanEval	Hand-written Python puzzles	164	No	None	Code generation only
MBPP	Crowd-sourced Python problems	~1,000	No	None	Code generation only
APPS	Algorithmic problems	10,000	No	None	Code generation only
CodeContests	Competition problems	~13,000	No	None	Code generation only
SWE-bench	Real GitHub issues	2,294	Periodic	Limited	End-to-end software engineering
BigCodeBench	Complex function-level tasks	1,140	No	None	Code generation, function completion
LiveCodeBench	Competition problems	1,055+	Continuous	Temporal windowing	Generation, repair, execution, prediction

LiveCodeBench's primary advantages are its continuous update mechanism and multi-task evaluation scope. Its primary limitation compared to benchmarks like SWE-bench is that it focuses on self-contained algorithmic problems rather than real-world software engineering tasks that require navigating large codebases, understanding project context, and writing tests.

Position in the modern benchmark stack

By 2026, frontier-model evaluations typically report a battery of coding metrics rather than a single number. A common configuration includes function-level benchmarks (HumanEval+, MBPP+), competitive programming (LiveCodeBench v6), elite programming (LiveCodeBench Pro), real codebase work (SWE-bench Verified), optimization (GSO), and agent workflows (Aider Polyglot, SWE-Agent). In this stack, LiveCodeBench and LiveCodeBench Pro together occupy the algorithmic-reasoning slot. They sit between the function-level benchmarks that have largely saturated (HumanEval is above 95% for most frontier models) and the agent-style benchmarks where evaluation cost and noise are much higher.

Adoption and impact

Since its release, LiveCodeBench has seen broad adoption across both academia and industry:

Major AI labs including OpenAI, Google DeepMind, Anthropic, DeepSeek, and xAI report LiveCodeBench scores in their model release announcements.
The Hugging Face platform hosts an official LiveCodeBench leaderboard that allows researchers to submit and compare results.
The benchmark's temporal windowing approach has influenced the design of other contamination-resistant benchmarks, including LiveBench (a general-purpose LLM benchmark) and LiveCodeBench Pro.
The paper was accepted at ICLR 2025, one of the top venues in machine learning research, and accumulated more than 600 citations in the code intelligence literature by mid-2026.
Independent aggregators such as Artificial Analysis, llm-stats.com, and PricePerToken treat LiveCodeBench as a primary signal for coding-quality rankings.

Methodology evolution since release

LiveCodeBench's methodology has not stayed static. The maintainers have iterated on the harness, problem-curation pipeline, and reporting conventions in response to community feedback and the emergence of new model capabilities.

Sampling protocol updates

The original paper used 10 samples per problem at temperature 0.2. By v5 the maintainers were also publishing single-sample numbers at temperature 0 for deterministic comparisons. With v6 the recommended protocol distinguishes three modes:

Mode	Temperature	Samples	Intended use
Deterministic	0.0	1	Cost-controlled comparisons and CI
Standard	0.2	10	Default leaderboard scoring
Reasoning	0.7-1.0	10-20	Long-thinking variants of frontier models

Problem deduplication and test-case audits

In late 2025 the maintainers added an aggressive deduplication step after community reports of nearly identical problems across platforms. The pipeline hashes the canonicalized problem statement and rejects new entries below an edit-distance threshold. A problem-quality changelog now accompanies each release. In early 2026, the team ran a test-case sufficiency audit by cross-checking model failures against human-verified solutions; problems where official tests passed buggy implementations were flagged, re-tested, or removed. This resulted in a 1-3 percentage point downward revision in headline scores between v5 and v6.

Criticism and ongoing debates

Like any widely-adopted benchmark, LiveCodeBench has attracted scrutiny. Several debates have shaped how the community interprets results:

Debate	Position 1	Position 2
Competitive programming vs. real coding	Olympiad-style problems are a poor proxy for software engineering	Algorithmic reasoning still correlates with coding ability and is harder to game
Contamination through fine-tuning	Open-weights models fine-tuned on Codeforces editorial archives game the benchmark	The temporal window ensures fairness regardless of fine-tuning data
Closed vs. open prompts	Per-model prompt engineering inflates closed-source numbers	Standardized prompts disadvantage models with different chat templates
Reasoning models on Pro	The Elo system over-rewards long deliberation impractical in production	Inference-time compute is a real capability that deserves measurement

A frequent 2025-2026 criticism is that strong LiveCodeBench scores have become necessary but not sufficient. Models that lead on LiveCodeBench but lag on SWE-bench Verified or agent harnesses are viewed as having a competitive-programming bias rather than well-rounded engineering skill.

Limitations

Despite its contributions, LiveCodeBench has several recognized limitations:

Limitation	Description	Impact
Platform dependency	Relies on external contest platforms for new problems	Data availability is tied to contest schedules
Algorithmic focus	Problems are primarily algorithmic puzzles from competitions	May not reflect practical software engineering tasks
Language bias	While multiple languages are supported, most evaluation focuses on Python	Limited assessment of language-specific capabilities
Execution cost	Sandboxed execution of generated code requires significant compute	Resource-intensive evaluation
Difficulty ceiling	Problems above a certain difficulty are excluded	Does not test the absolute upper bound of model capability
Update irregularity	New problems depend on contest schedules across three platforms	Uneven temporal coverage
Reasoning compute confound	Reasoning-mode scores depend on token budgets and sampling temperature	Pass@1 numbers across labs may not be directly comparable

Future directions

The LiveCodeBench team has outlined several directions for future development:

Expanding problem sources to include additional platforms and real-world software engineering tasks
Broadening programming language coverage beyond the current focus on Python, C++, and Java
Adding interactive debugging scenarios that require multi-turn problem-solving
Incorporating multi-agent coding tasks that simulate collaborative development
Integrating documentation generation and code explanation tasks
Developing better contamination detection methods that can identify more subtle forms of data leakage
Releasing a v7 dataset covering contests through late 2025 and early 2026 with v6 methodology baked in from release
Standardizing the reasoning-mode evaluation protocol so that thinking-token budgets are reported alongside Pass@1

References

Jain, N., Han, K., Gu, A., Li, W.-D., Yan, F., Zhang, T., Wang, S., Solar-Lezama, A., Sen, K., & Stoica, I. (2024). "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code." *arXiv preprint arXiv:2403.07974*. Published as a conference paper at ICLR 2025. https://arxiv.org/abs/2403.07974
LiveCodeBench Official Website. https://livecodebench.github.io/
LiveCodeBench GitHub Repository. https://github.com/LiveCodeBench/LiveCodeBench
LiveCodeBench Leaderboard. https://livecodebench.github.io/leaderboard.html
Hugging Face Blog. "Introducing the LiveCodeBench Leaderboard: Holistic and Contamination-Free Evaluation of Code LLMs." https://huggingface.co/blog/leaderboard-livecodebench
Zheng, Z., Cheng, P., et al. (2025). "LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming?" *arXiv preprint arXiv:2506.11928*. https://arxiv.org/abs/2506.11928
LiveCodeBench on Hugging Face. https://huggingface.co/livecodebench
LiveCodeBench Pro Leaderboard. https://livecodebenchpro.com/
Artificial Analysis LiveCodeBench Leaderboard. https://artificialanalysis.ai/evaluations/livecodebench
LiveCodeBench on OpenReview (ICLR 2025). https://openreview.net/forum?id=chfJJYC3iL
PricePerToken LiveCodeBench Leaderboard. https://pricepertoken.com/leaderboards/benchmark/livecodebench
llm-stats.com LiveCodeBench Benchmark Leaderboard. https://llm-stats.com/benchmarks/livecodebench
BenchLM.ai Coding Benchmark Tracker. https://benchlm.ai/coding
LiveCodeBench v6 Benchmark Leaderboard. https://llm-stats.com/benchmarks/livecodebench-v6
LiveCodeBench Pro Benchmark Leaderboard. https://llm-stats.com/benchmarks/livecodebench-pro

Background and motivation

Authors and institutional background

Problem collection process

Source platforms

Scraping and filtering pipeline

Difficulty classification

Test case statistics

Dataset versions and growth

The v5 and v6 evaluation windows

Evaluation tasks

Code generation

Self-repair

Code execution

Test output prediction

Task comparison table

Contamination prevention

Temporal windowing

Contamination detection

Contamination across platforms

2025-2026 contamination findings

Evaluation setup and implementation

Generation parameters

Execution environment

Installation and usage

Key findings from the original paper

Model rankings (original paper, 2024)

HumanEval vs. LiveCodeBench correlation

Task-specific observations

Current leaderboard (2026)

Lab-specific 2026 highlights

Discrepancies between leaderboards

The rise of reasoning models

LiveCodeBench Pro

Key differences from the original

Elo rating system

Findings from the original release

2026 Elo leaderboard

LiveCodeBench Pro v2 update

GSO: Global Software Optimization

Comparison with other code benchmarks

Position in the modern benchmark stack

Adoption and impact

Methodology evolution since release

Sampling protocol updates

Problem deduplication and test-case audits

Criticism and ongoing debates

Limitations

Future directions

See also

References

Improve this article

Related Articles

ARC-AGI 2

Claude Sonnet 4.5

Humanity's Last Exam

CRUXEval

CodeContests

MBPP

Background and motivation

Authors and institutional background

Problem collection process

Source platforms

Scraping and filtering pipeline

Difficulty classification

Test case statistics

Dataset versions and growth

The v5 and v6 evaluation windows

Evaluation tasks

Code generation

Self-repair

Code execution

Test output prediction

Task comparison table

Contamination prevention

Temporal windowing

Contamination detection

Contamination across platforms

2025-2026 contamination findings

Evaluation setup and implementation

Generation parameters