LiveCodeBench
| LiveCodeBench | |
|---|---|
| Overview | |
| Full name | Live Code Benchmark |
| Abbreviation | LCB |
| Description | A holistic and contamination-free evaluation benchmark for code LLMs with continuous updates |
| Release date | 2024-03 |
| Latest version | v6 |
| Benchmark updated | 2025-04 |
| Authors | Naman Jain, King Han, Alex Gu, Et al. |
| Organization | UC Berkeley, MIT, Ion Stoica Lab |
| Technical Details | |
| Type | Code Generation, Code Understanding, Multi-task |
| Modality | Text (Code) |
| Task format | Code generation, self-repair, test output prediction, execution |
| Number of tasks | 1055+ (as of v6) |
| Total examples | 1055+ |
| Evaluation metric | Pass@1, Pass@k, Execution accuracy |
| Domains | Competitive programming, Software engineering |
| Languages | Multiple programming languages |
| Performance | |
| Human performance | Variable by task |
| Baseline | ~20-30% (smaller models) |
| SOTA score | 73.3% |
| SOTA model | DeepSeek R1-0528 |
| SOTA date | 2025-05 |
| Saturated | No |
| Resources | |
| Website | Official website |
| Paper | Paper |
| GitHub | Repository |
| Dataset | Download |
| License | MIT
|
LiveCodeBench is a comprehensive artificial intelligence benchmark designed to provide holistic and contamination-free evaluation of large language models for code-related tasks. Released in March 2024 by researchers from UC Berkeley, MIT, and other institutions, LiveCodeBench addresses the critical issue of data contamination in code benchmarks by continuously collecting new problems from competitive programming platforms including LeetCode, AtCoder, and CodeForces, ensuring models are evaluated on problems they haven't encountered during training.
Overview
LiveCodeBench revolutionizes code model evaluation by implementing a dynamic, continuously updating benchmark that prevents the common problem of test set contamination. Unlike static benchmarks where models may have seen test problems during training, LiveCodeBench maintains temporal integrity by tagging problems with release dates and evaluating models only on problems released after their training cutoff dates.
Motivation
The development of LiveCodeBench was driven by several critical challenges in code model evaluation:
- Widespread contamination in existing static benchmarks
- Limited scope of evaluation focusing only on code generation
- Lack of real-world software engineering task coverage
- Need for holistic assessment of coding capabilities
- Absence of temporal tracking for fair model comparison
The benchmark specifically addresses the need for comprehensive evaluation that mirrors real-world software development practices, including debugging, testing, and code understanding.
Technical Architecture
Core Components
| Component | Description | Function |
|---|---|---|
| Problem Collector | Automated system for gathering new problems | Maintains fresh evaluation data |
| Temporal Tagger | Date-based problem classification | Ensures contamination-free evaluation |
| Multi-task Evaluator | Comprehensive task assessment framework | Holistic capability measurement |
| Execution Environment | Sandboxed code execution system | Validates functional correctness |
Problem Sources
LiveCodeBench aggregates problems from three major competitive programming platforms:
| Platform | Problem Types | Update Frequency | Difficulty Range |
|---|---|---|---|
| LeetCode | Algorithm, data structures | Weekly contests | Easy to Hard |
| AtCoder | Mathematical, algorithmic | Regular contests | Beginner to Expert |
| CodeForces | Competitive programming | Bi-weekly rounds | Div 3 to Div 1 |
Dataset Evolution
| Version | Release Date | Problem Count | Coverage Period |
|---|---|---|---|
| v1 | March 2024 | 400 | May 2023 - March 2024 |
| v4 | September 2024 | 713 | May 2023 - September 2024 |
| v5 | January 2025 | 880 | May 2023 - January 2025 |
| v6 | April 2025 | 1055 | May 2023 - April 2025 |
Evaluation Tasks
Task Categories
LiveCodeBench evaluates models across four primary task categories:
| Task | Description | Real-world Relevance | Evaluation Metric |
|---|---|---|---|
| Code Generation | Generate complete solutions from problem descriptions | Core programming skill | Pass@1, Pass@k |
| Self-Repair | Fix bugs in provided incorrect code | Debugging capability | Repair success rate |
| Test Output Prediction | Predict outputs for given test cases | Code understanding | Prediction accuracy |
| Code Execution | Trace and predict execution behavior | Runtime analysis | Execution accuracy |
Difficulty Levels
Problems are categorized into three difficulty tiers:
| Level | Description | Typical Complexity | Success Rate Range |
|---|---|---|---|
| Easy | Basic algorithms and data structures | O(n), O(n log n) | 60-80% |
| Medium | Intermediate algorithms, optimization | O(n²), dynamic programming | 30-60% |
| Hard | Advanced algorithms, complex logic | Complex DP, graphs | 10-30% |
Contamination Prevention
Temporal Windowing
LiveCodeBench implements a sophisticated contamination prevention system:
| Strategy | Implementation | Benefit |
|---|---|---|
| Release Dating | Each problem tagged with publication date | Temporal tracking |
| Model Cutoff Dates | Track training data cutoffs for all models | Fair comparison |
| Dynamic Filtering | Only evaluate on post-cutoff problems | Contamination-free |
| Red Flagging | Mark potentially contaminated results | Transparency |
Contamination Detection
The benchmark can identify potential contamination through:
- Performance drops on newer problems
- Anomalous accuracy patterns
- Temporal performance analysis
- Cross-reference with model training dates
Performance Analysis
Current Leaderboard (2025)
| Rank | Model | Pass@1 Overall | Code Generation | Self-Repair | Test Prediction |
|---|---|---|---|---|---|
| 1 | DeepSeek R1-0528 | 73.3% | 75% | 68% | 72% |
| 2 | GPT-4 Turbo | 71.2% | 73% | 66% | 70% |
| 3 | Claude 3 Opus | 70.8% | 71% | 65% | 74% |
| 4 | GPT-4o | 69.5% | 72% | 64% | 68% |
| 5 | Claude 3.5 Sonnet | 67.3% | 69% | 62% | 66% |
| 6 | DeepSeek-Coder-33B | 63.5% | 65% | 58% | 62% |
| 7 | Phind-CodeLlama-34B | 61.2% | 63% | 56% | 60% |
| 8 | CodeLlama-70B | 58.7% | 60% | 54% | 57% |
Performance Insights
Task-Specific Observations
- **Code Generation**: Models excel at standard algorithmic problems
- **Self-Repair**: Significant performance drop compared to generation
- **Test Prediction**: Claude models show particular strength
- **Execution Tracing**: Most challenging task across all models
Contamination Effects
DeepSeek models exhibit notable performance patterns:
- Pre-September 2023 problems: ~80% accuracy
- Post-September 2023 problems: ~63% accuracy
- Clear indication of potential contamination in earlier problems
Implementation
Installation and Setup
```bash
- Clone the repository
git clone https://github.com/LiveCodeBench/LiveCodeBench cd LiveCodeBench
- Install dependencies
pip install -e .
- Download latest dataset
python scripts/download_data.py --version v6 ```
Running Evaluations
```python
- Basic evaluation
from livecodebench import LiveCodeBench
- Initialize benchmark
lcb = LiveCodeBench(version='v6')
- Evaluate model on recent problems
results = lcb.evaluate(
model='gpt-4',
tasks=['code_generation', 'self_repair'],
date_range=('2024-01-01', '2025-04-01')
)
- Get contamination-free results
clean_results = lcb.evaluate_clean(
model='gpt-4', cutoff_date='2023-09-01'
) ```
Custom Problem Filtering
```python
- Filter by difficulty
easy_problems = lcb.filter_problems(difficulty='easy')
- Filter by platform
leetcode_only = lcb.filter_problems(source='leetcode')
- Temporal filtering
recent_problems = lcb.filter_problems(
start_date='2025-01-01', end_date='2025-04-01'
) ```
Holistic Evaluation Framework
Multi-Scenario Assessment
| Scenario | Description | Metrics | Weight |
|---|---|---|---|
| Generation Only | Pure code synthesis | Pass@1, Pass@5 | 40% |
| Generation + Repair | Initial attempt + self-correction | Combined success | 25% |
| Understanding | Test prediction + execution | Accuracy | 20% |
| Full Pipeline | All tasks combined | Weighted average | 15% |
Real-World Alignment
LiveCodeBench's tasks mirror actual software development:
| Development Phase | LiveCodeBench Task | Skill Tested |
|---|---|---|
| Implementation | Code Generation | Algorithm design |
| Debugging | Self-Repair | Error identification |
| Testing | Test Output Prediction | Code comprehension |
| Review | Execution Tracing | Logic verification |
Significance and Impact
Research Applications
| Application | Purpose | Value |
|---|---|---|
| Model Development | Identifying capability gaps | Targeted improvement |
| Contamination Studies | Understanding data leakage | Evaluation integrity |
| Temporal Analysis | Tracking progress over time | Historical comparison |
| Task Transfer | Cross-task performance correlation | Capability understanding |
Industry Applications
- **Hiring Assessment**: Evaluating coding interview tools
- **IDE Integration**: Testing code completion systems
- **Educational Tools**: Assessing programming tutors
- **Code Review**: Evaluating automated review systems
- **DevOps**: Testing CI/CD automation capabilities
Challenges and Limitations
Current Limitations
| Limitation | Description | Impact |
|---|---|---|
| Platform Dependency | Relies on external contest platforms | Data availability |
| Language Coverage | Primarily Python, Java, C++ | Limited scope |
| Problem Types | Focus on algorithmic challenges | May miss practical tasks |
| Execution Cost | Requires sandboxed execution | Resource intensive |
| Update Frequency | Depends on contest schedules | Irregular additions |
Future Directions
1. **Expanded Problem Sources**: Integration with more platforms 2. **Enterprise Tasks**: Real-world software engineering problems 3. **Multi-language Support**: Broader programming language coverage 4. **Interactive Debugging**: Multi-turn problem-solving 5. **Team Collaboration**: Multi-agent coding scenarios 6. **Documentation Tasks**: Code documentation generation
Related Benchmarks
- HumanEval: Classic code generation benchmark
- MBPP: Python programming problems
- SWE-bench: Software engineering tasks
- CodeContests: Competitive programming dataset
- APPS: Algorithmic problem solving
- BigCodeBench: Large-scale code evaluation
- MultiPL-E: Multi-language code generation
Significance
LiveCodeBench represents a paradigm shift in code model evaluation, addressing the critical contamination problem that undermines many existing benchmarks. Its continuous update mechanism and holistic evaluation approach provide:
- Reliable contamination-free assessment
- Comprehensive capability evaluation beyond generation
- Temporal tracking for fair model comparison
- Real-world task alignment
- Sustainable evaluation framework for future models
The benchmark's ability to detect contamination and provide genuine performance metrics makes it essential for advancing code-capable AI systems.
See Also
- Code Generation
- Program Synthesis
- Automated Debugging
- Competitive Programming
- Software Engineering AI
- Benchmark Contamination
- Temporal Evaluation
References
Cite error: <ref> tag with name "lcb_paper" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "lcb_website" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "lcb_github" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "lcb_leaderboard" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "hf_blog" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "marktechpost" defined in <references> has group attribute "" which does not appear in prior text.