| LiveCodeBench | |
|---|---|
| Overview | |
| Full name | Live Code Benchmark |
| Abbreviation | LCB |
| Description | A holistic and contamination-free evaluation benchmark for code LLMs with continuous updates |
| Release date | 2024-03 |
| Latest version | v6 |
| Benchmark updated | 2025-04 |
| Authors | Naman Jain, King Han, Alex Gu, Et al. |
| Organization | UC Berkeley, MIT, Ion Stoica Lab |
| Technical Details | |
| Type | Code Generation, Code Understanding, Multi-task |
| Modality | Text (Code) |
| Task format | Code generation, self-repair, test output prediction, execution |
| Number of tasks | 1055+ (as of v6) |
| Total examples | 1055+ |
| Evaluation metric | Pass@1, Pass@k, Execution accuracy |
| Domains | Competitive programming, Software engineering |
| Languages | Multiple programming languages |
| Performance | |
| Human performance | Variable by task |
| Baseline | ~20-30% (smaller models) |
| SOTA score | 73.3% |
| SOTA model | DeepSeek R1-0528 |
| SOTA date | 2025-05 |
| Saturated | No |
| Resources | |
| Website | Official website |
| Paper | Paper |
| GitHub | Repository |
| Dataset | Download |
| License | MIT
|
LiveCodeBench is a comprehensive artificial intelligence benchmark designed to provide holistic and contamination-free evaluation of large language models for code-related tasks. Released in March 2024 by researchers from UC Berkeley, MIT, and other institutions, LiveCodeBench addresses the critical issue of data contamination in code benchmarks by continuously collecting new problems from competitive programming platforms including LeetCode, AtCoder, and CodeForces, ensuring models are evaluated on problems they haven't encountered during training.
LiveCodeBench revolutionizes code model evaluation by implementing a dynamic, continuously updating benchmark that prevents the common problem of test set contamination. Unlike static benchmarks where models may have seen test problems during training, LiveCodeBench maintains temporal integrity by tagging problems with release dates and evaluating models only on problems released after their training cutoff dates.
The development of LiveCodeBench was driven by several critical challenges in code model evaluation:
The benchmark specifically addresses the need for comprehensive evaluation that mirrors real-world software development practices, including debugging, testing, and code understanding.
| Component | Description | Function |
|---|---|---|
| Problem Collector | Automated system for gathering new problems | Maintains fresh evaluation data |
| Temporal Tagger | Date-based problem classification | Ensures contamination-free evaluation |
| Multi-task Evaluator | Comprehensive task assessment framework | Holistic capability measurement |
| Execution Environment | Sandboxed code execution system | Validates functional correctness |
LiveCodeBench aggregates problems from three major competitive programming platforms:
| Platform | Problem Types | Update Frequency | Difficulty Range |
|---|---|---|---|
| LeetCode | Algorithm, data structures | Weekly contests | Easy to Hard |
| AtCoder | Mathematical, algorithmic | Regular contests | Beginner to Expert |
| CodeForces | Competitive programming | Bi-weekly rounds | Div 3 to Div 1 |
| Version | Release Date | Problem Count | Coverage Period |
|---|---|---|---|
| v1 | March 2024 | 400 | May 2023 - March 2024 |
| v4 | September 2024 | 713 | May 2023 - September 2024 |
| v5 | January 2025 | 880 | May 2023 - January 2025 |
| v6 | April 2025 | 1055 | May 2023 - April 2025 |
LiveCodeBench evaluates models across four primary task categories:
| Task | Description | Real-world Relevance | Evaluation Metric |
|---|---|---|---|
| Code Generation | Generate complete solutions from problem descriptions | Core programming skill | Pass@1, Pass@k |
| Self-Repair | Fix bugs in provided incorrect code | Debugging capability | Repair success rate |
| Test Output Prediction | Predict outputs for given test cases | Code understanding | Prediction accuracy |
| Code Execution | Trace and predict execution behavior | Runtime analysis | Execution accuracy |
Problems are categorized into three difficulty tiers:
| Level | Description | Typical Complexity | Success Rate Range |
|---|---|---|---|
| Easy | Basic algorithms and data structures | O(n), O(n log n) | 60-80% |
| Medium | Intermediate algorithms, optimization | O(n²), dynamic programming | 30-60% |
| Hard | Advanced algorithms, complex logic | Complex DP, graphs | 10-30% |
LiveCodeBench implements a sophisticated contamination prevention system:
| Strategy | Implementation | Benefit |
|---|---|---|
| Release Dating | Each problem tagged with publication date | Temporal tracking |
| Model Cutoff Dates | Track training data cutoffs for all models | Fair comparison |
| Dynamic Filtering | Only evaluate on post-cutoff problems | Contamination-free |
| Red Flagging | Mark potentially contaminated results | Transparency |
The benchmark can identify potential contamination through:
| Rank | Model | Pass@1 Overall | Code Generation | Self-Repair | Test Prediction |
|---|---|---|---|---|---|
| 1 | DeepSeek R1-0528 | 73.3% | 75% | 68% | 72% |
| 2 | GPT-4 Turbo | 71.2% | 73% | 66% | 70% |
| 3 | Claude 3 Opus | 70.8% | 71% | 65% | 74% |
| 4 | GPT-4o | 69.5% | 72% | 64% | 68% |
| 5 | Claude 3.5 Sonnet | 67.3% | 69% | 62% | 66% |
| 6 | DeepSeek-Coder-33B | 63.5% | 65% | 58% | 62% |
| 7 | Phind-CodeLlama-34B | 61.2% | 63% | 56% | 60% |
| 8 | CodeLlama-70B | 58.7% | 60% | 54% | 57% |
DeepSeek models exhibit notable performance patterns:
```bash
git clone https://github.com/LiveCodeBench/LiveCodeBench cd LiveCodeBench
pip install -e .
python scripts/download_data.py --version v6 ```
```python
from livecodebench import LiveCodeBench
lcb = LiveCodeBench(version='v6')
results = lcb.evaluate(
model='gpt-4',
tasks=['code_generation', 'self_repair'],
date_range=('2024-01-01', '2025-04-01')
)
clean_results = lcb.evaluate_clean(
model='gpt-4', cutoff_date='2023-09-01'
) ```
```python
easy_problems = lcb.filter_problems(difficulty='easy')
leetcode_only = lcb.filter_problems(source='leetcode')
recent_problems = lcb.filter_problems(
start_date='2025-01-01', end_date='2025-04-01'
) ```
| Scenario | Description | Metrics | Weight |
|---|---|---|---|
| Generation Only | Pure code synthesis | Pass@1, Pass@5 | 40% |
| Generation + Repair | Initial attempt + self-correction | Combined success | 25% |
| Understanding | Test prediction + execution | Accuracy | 20% |
| Full Pipeline | All tasks combined | Weighted average | 15% |
LiveCodeBench's tasks mirror actual software development:
| Development Phase | LiveCodeBench Task | Skill Tested |
|---|---|---|
| Implementation | Code Generation | Algorithm design |
| Debugging | Self-Repair | Error identification |
| Testing | Test Output Prediction | Code comprehension |
| Review | Execution Tracing | Logic verification |
| Application | Purpose | Value |
|---|---|---|
| Model Development | Identifying capability gaps | Targeted improvement |
| Contamination Studies | Understanding data leakage | Evaluation integrity |
| Temporal Analysis | Tracking progress over time | Historical comparison |
| Task Transfer | Cross-task performance correlation | Capability understanding |
| Limitation | Description | Impact |
|---|---|---|
| Platform Dependency | Relies on external contest platforms | Data availability |
| Language Coverage | Primarily Python, Java, C++ | Limited scope |
| Problem Types | Focus on algorithmic challenges | May miss practical tasks |
| Execution Cost | Requires sandboxed execution | Resource intensive |
| Update Frequency | Depends on contest schedules | Irregular additions |
1. **Expanded Problem Sources**: Integration with more platforms 2. **Enterprise Tasks**: Real-world software engineering problems 3. **Multi-language Support**: Broader programming language coverage 4. **Interactive Debugging**: Multi-turn problem-solving 5. **Team Collaboration**: Multi-agent coding scenarios 6. **Documentation Tasks**: Code documentation generation
LiveCodeBench represents a paradigm shift in code model evaluation, addressing the critical contamination problem that undermines many existing benchmarks. Its continuous update mechanism and holistic evaluation approach provide:
The benchmark's ability to detect contamination and provide genuine performance metrics makes it essential for advancing code-capable AI systems.
Cite error: <ref> tag with name "lcb_paper" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "lcb_website" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "lcb_github" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "lcb_leaderboard" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "hf_blog" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "marktechpost" defined in <references> has group attribute "" which does not appear in prior text.