LiveCodeBench

LiveCodeBench
Overview
Full name	Live Code Benchmark
Abbreviation	LCB
Description	A holistic and contamination-free evaluation benchmark for code LLMs with continuous updates
Release date	2024-03
Latest version	v6
Benchmark updated	2025-04
Authors	Naman Jain, King Han, Alex Gu, Et al.
Organization	UC Berkeley, MIT, Ion Stoica Lab
Technical Details
Type	Code Generation, Code Understanding, Multi-task
Modality	Text (Code)
Task format	Code generation, self-repair, test output prediction, execution
Number of tasks	1055+ (as of v6)
Total examples	1055+
Evaluation metric	Pass@1, Pass@k, Execution accuracy
Domains	Competitive programming, Software engineering
Languages	Multiple programming languages
Performance
Human performance	Variable by task
Baseline	~20-30% (smaller models)
SOTA score	73.3%
SOTA model	DeepSeek R1-0528
SOTA date	2025-05
Saturated	No
Resources
Website	Official website
Paper	Paper
GitHub	Repository
Dataset	Download
License	MIT ;

LiveCodeBench is a comprehensive artificial intelligence benchmark designed to provide holistic and contamination-free evaluation of large language models for code-related tasks. Released in March 2024 by researchers from UC Berkeley, MIT, and other institutions, LiveCodeBench addresses the critical issue of data contamination in code benchmarks by continuously collecting new problems from competitive programming platforms including LeetCode, AtCoder, and CodeForces, ensuring models are evaluated on problems they haven't encountered during training.

Overview

LiveCodeBench revolutionizes code model evaluation by implementing a dynamic, continuously updating benchmark that prevents the common problem of test set contamination. Unlike static benchmarks where models may have seen test problems during training, LiveCodeBench maintains temporal integrity by tagging problems with release dates and evaluating models only on problems released after their training cutoff dates.

Motivation

The development of LiveCodeBench was driven by several critical challenges in code model evaluation:

Widespread contamination in existing static benchmarks
Limited scope of evaluation focusing only on code generation
Lack of real-world software engineering task coverage
Need for holistic assessment of coding capabilities
Absence of temporal tracking for fair model comparison

The benchmark specifically addresses the need for comprehensive evaluation that mirrors real-world software development practices, including debugging, testing, and code understanding.

Technical Architecture

Core Components

Component	Description	Function
Problem Collector	Automated system for gathering new problems	Maintains fresh evaluation data
Temporal Tagger	Date-based problem classification	Ensures contamination-free evaluation
Multi-task Evaluator	Comprehensive task assessment framework	Holistic capability measurement
Execution Environment	Sandboxed code execution system	Validates functional correctness

Problem Sources

LiveCodeBench aggregates problems from three major competitive programming platforms:

Platform	Problem Types	Update Frequency	Difficulty Range
LeetCode	Algorithm, data structures	Weekly contests	Easy to Hard
AtCoder	Mathematical, algorithmic	Regular contests	Beginner to Expert
CodeForces	Competitive programming	Bi-weekly rounds	Div 3 to Div 1

Dataset Evolution

Version	Release Date	Problem Count	Coverage Period
v1	March 2024	400	May 2023 - March 2024
v4	September 2024	713	May 2023 - September 2024
v5	January 2025	880	May 2023 - January 2025
v6	April 2025	1055	May 2023 - April 2025

Evaluation Tasks

Task Categories

LiveCodeBench evaluates models across four primary task categories:

Task	Description	Real-world Relevance	Evaluation Metric
Code Generation	Generate complete solutions from problem descriptions	Core programming skill	Pass@1, Pass@k
Self-Repair	Fix bugs in provided incorrect code	Debugging capability	Repair success rate
Test Output Prediction	Predict outputs for given test cases	Code understanding	Prediction accuracy
Code Execution	Trace and predict execution behavior	Runtime analysis	Execution accuracy

Difficulty Levels

Problems are categorized into three difficulty tiers:

Level	Description	Typical Complexity	Success Rate Range
Easy	Basic algorithms and data structures	O(n), O(n log n)	60-80%
Medium	Intermediate algorithms, optimization	O(n²), dynamic programming	30-60%
Hard	Advanced algorithms, complex logic	Complex DP, graphs	10-30%

Contamination Prevention

Temporal Windowing

LiveCodeBench implements a sophisticated contamination prevention system:

Strategy	Implementation	Benefit
Release Dating	Each problem tagged with publication date	Temporal tracking
Model Cutoff Dates	Track training data cutoffs for all models	Fair comparison
Dynamic Filtering	Only evaluate on post-cutoff problems	Contamination-free
Red Flagging	Mark potentially contaminated results	Transparency

Contamination Detection

The benchmark can identify potential contamination through:

Performance drops on newer problems
Anomalous accuracy patterns
Temporal performance analysis
Cross-reference with model training dates

Performance Analysis

Current Leaderboard (2025)

Rank	Model	Pass@1 Overall	Code Generation	Self-Repair	Test Prediction
1	DeepSeek R1-0528	73.3%	75%	68%	72%
2	GPT-4 Turbo	71.2%	73%	66%	70%
3	Claude 3 Opus	70.8%	71%	65%	74%
4	GPT-4o	69.5%	72%	64%	68%
5	Claude 3.5 Sonnet	67.3%	69%	62%	66%
6	DeepSeek-Coder-33B	63.5%	65%	58%	62%
7	Phind-CodeLlama-34B	61.2%	63%	56%	60%
8	CodeLlama-70B	58.7%	60%	54%	57%

Performance Insights

Task-Specific Observations

**Code Generation**: Models excel at standard algorithmic problems
**Self-Repair**: Significant performance drop compared to generation
**Test Prediction**: Claude models show particular strength
**Execution Tracing**: Most challenging task across all models

Contamination Effects

DeepSeek models exhibit notable performance patterns:

Pre-September 2023 problems: ~80% accuracy
Post-September 2023 problems: ~63% accuracy
Clear indication of potential contamination in earlier problems

Implementation

Installation and Setup

```bash

Clone the repository

git clone https://github.com/LiveCodeBench/LiveCodeBench cd LiveCodeBench

Install dependencies

pip install -e .

Download latest dataset

python scripts/download_data.py --version v6 ```

Running Evaluations

```python

Basic evaluation

from livecodebench import LiveCodeBench

Initialize benchmark

lcb = LiveCodeBench(version='v6')

Evaluate model on recent problems

results = lcb.evaluate(

   model='gpt-4',
   tasks=['code_generation', 'self_repair'],
   date_range=('2024-01-01', '2025-04-01')

)

Get contamination-free results

clean_results = lcb.evaluate_clean(

   model='gpt-4',
   cutoff_date='2023-09-01'

) ```

Custom Problem Filtering

```python

Filter by difficulty

easy_problems = lcb.filter_problems(difficulty='easy')

Filter by platform

leetcode_only = lcb.filter_problems(source='leetcode')

Temporal filtering

recent_problems = lcb.filter_problems(

   start_date='2025-01-01',
   end_date='2025-04-01'

) ```

Holistic Evaluation Framework

Multi-Scenario Assessment

Scenario	Description	Metrics	Weight
Generation Only	Pure code synthesis	Pass@1, Pass@5	40%
Generation + Repair	Initial attempt + self-correction	Combined success	25%
Understanding	Test prediction + execution	Accuracy	20%
Full Pipeline	All tasks combined	Weighted average	15%

Real-World Alignment

LiveCodeBench's tasks mirror actual software development:

Development Phase	LiveCodeBench Task	Skill Tested
Implementation	Code Generation	Algorithm design
Debugging	Self-Repair	Error identification
Testing	Test Output Prediction	Code comprehension
Review	Execution Tracing	Logic verification

Significance and Impact

Research Applications

Application	Purpose	Value
Model Development	Identifying capability gaps	Targeted improvement
Contamination Studies	Understanding data leakage	Evaluation integrity
Temporal Analysis	Tracking progress over time	Historical comparison
Task Transfer	Cross-task performance correlation	Capability understanding

Industry Applications

**Hiring Assessment**: Evaluating coding interview tools
**IDE Integration**: Testing code completion systems
**Educational Tools**: Assessing programming tutors
**Code Review**: Evaluating automated review systems
**DevOps**: Testing CI/CD automation capabilities

Challenges and Limitations

Current Limitations

Limitation	Description	Impact
Platform Dependency	Relies on external contest platforms	Data availability
Language Coverage	Primarily Python, Java, C++	Limited scope
Problem Types	Focus on algorithmic challenges	May miss practical tasks
Execution Cost	Requires sandboxed execution	Resource intensive
Update Frequency	Depends on contest schedules	Irregular additions

Future Directions

1. **Expanded Problem Sources**: Integration with more platforms 2. **Enterprise Tasks**: Real-world software engineering problems 3. **Multi-language Support**: Broader programming language coverage 4. **Interactive Debugging**: Multi-turn problem-solving 5. **Team Collaboration**: Multi-agent coding scenarios 6. **Documentation Tasks**: Code documentation generation

Related Benchmarks

HumanEval: Classic code generation benchmark
MBPP: Python programming problems
SWE-bench: Software engineering tasks
CodeContests: Competitive programming dataset
APPS: Algorithmic problem solving
BigCodeBench: Large-scale code evaluation
MultiPL-E: Multi-language code generation

Significance

LiveCodeBench represents a paradigm shift in code model evaluation, addressing the critical contamination problem that undermines many existing benchmarks. Its continuous update mechanism and holistic evaluation approach provide:

Reliable contamination-free assessment
Comprehensive capability evaluation beyond generation
Temporal tracking for fair model comparison
Real-world task alignment
Sustainable evaluation framework for future models

The benchmark's ability to detect contamination and provide genuine performance metrics makes it essential for advancing code-capable AI systems.

References

Cite error: <ref> tag with name "lcb_paper" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "lcb_website" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "lcb_github" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "lcb_leaderboard" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "hf_blog" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "marktechpost" defined in <references> has group attribute "" which does not appear in prior text.