SWE-bench Verified
| SWE-bench Verified | |
|---|---|
| Overview | |
| Full name | Software Engineering Benchmark - Verified |
| Abbreviation | SWE-bench V |
| Description | A human-validated subset of real-world GitHub issues for evaluating AI models' autonomous software engineering capabilities |
| Release date | 2024-08-13 |
| Latest version | 1.0 |
| Benchmark updated | 2024-08 |
| Authors | Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, Karthik Narasimhan |
| Organization | Princeton University, OpenAI |
| Technical Details | |
| Type | Code Generation, Bug Fixing, Software Engineering |
| Modality | Text, Code |
| Task format | GitHub issue resolution |
| Number of tasks | 500 |
| Total examples | 500 verified issues |
| Evaluation metric | Resolve rate, Test pass rate |
| Domains | Web frameworks, Scientific computing, Documentation, Machine learning |
| Languages | Python |
| Performance | |
| Human performance | 100% (verified solvable) |
| Baseline | 33.2% (GPT-4o at launch) |
| SOTA score | 74.5% |
| SOTA model | Claude Opus 4.1 |
| SOTA date | 2025 |
| Saturated | No |
| Resources | |
| Website | Official website |
| Paper | Paper |
| GitHub | Repository |
| Dataset | Download |
| License | MIT |
| Predecessor | SWE-bench |
SWE-bench Verified is a rigorously validated software engineering benchmark designed to evaluate artificial intelligence models' ability to autonomously resolve real-world GitHub issues. Released on August 13, 2024, through a collaboration between Princeton University and OpenAI Preparedness, SWE-bench Verified consists of 500 carefully selected and human-verified problems from the original SWE-bench dataset. Each problem requires an AI model to understand a bug report or feature request, navigate a complex codebase, write solution code, and ensure all tests pass, mirroring the complete workflow of a professional software engineer.
Overview
SWE-bench Verified addresses critical limitations in the original SWE-bench dataset by ensuring all included problems are solvable and fairly evaluated. The benchmark tests whether AI systems can perform end-to-end software engineering tasks: from understanding issue descriptions to implementing working solutions that pass comprehensive test suites[1].
Motivation
The creation of SWE-bench Verified was motivated by several factors:
- **Quality Issues**: 68.3% of original SWE-bench samples had problems with underspecified descriptions or unfair test criteria
- **Evaluation Reliability**: Need for accurate measurement of AI coding capabilities
- **Real-World Relevance**: Focus on actual software engineering tasks rather than synthetic problems
- **Autonomous Capability**: Testing complete problem-solving rather than code completion
Problem Characteristics
Repository Distribution
SWE-bench Verified draws from 12 popular Python open-source projects:
| Repository | Description | Approximate % of Dataset |
|---|---|---|
| Django | Web framework | ~45% |
| SymPy | Symbolic mathematics | ~15% |
| Sphinx | Documentation generator | ~10% |
| Matplotlib | Plotting library | ~8% |
| Scikit-learn | Machine learning library | ~7% |
| Flask | Micro web framework | ~5% |
| Requests | HTTP library | ~3% |
| Pytest | Testing framework | ~2% |
| Astropy | Astronomy tools | ~2% |
| Xarray | N-D labeled arrays | ~1% |
| Seaborn | Statistical visualization | ~1% |
| Pylint | Code analysis | ~1% |
The five largest repositories (Django, SymPy, Sphinx, Matplotlib, Scikit-learn) account for over 80% of the benchmark[2].
Problem Categories
| Category | Description | Example Tasks |
|---|---|---|
| Bug Fixes | Resolving reported bugs | Fixing edge cases, correcting logic errors |
| Feature Implementation | Adding new functionality | Implementing requested features |
| Performance Issues | Optimization problems | Improving efficiency, reducing memory usage |
| Documentation | Documentation-related issues | Updating docstrings, fixing examples |
| Compatibility | Cross-version compatibility | Python version compatibility fixes |
| Testing | Test-related issues | Fixing test failures, adding test coverage |
Difficulty Distribution
| Difficulty | Number of Problems | Time to Solve | Description |
|---|---|---|---|
| Easy | 196 | <15 minutes | Simple bug fixes, minor changes |
| Medium | 259 | 15-60 minutes | Moderate complexity, multiple file changes |
| Hard | 45 | >1 hour | Complex issues, architectural changes |
Human Validation Process
Validation Methodology
The validation process involved rigorous human review:
| Step | Process | Outcome |
|---|---|---|
| Annotator Selection | 93 Python-experienced developers recruited | Expert review team assembled |
| Review Protocol | Detailed rubric for evaluation | Consistent assessment criteria |
| Triple Review | Each problem reviewed by 3 independent annotators | Multiple perspectives |
| Quality Criteria | Assessed clarity, test appropriateness, solvability | Comprehensive evaluation |
| Filtering | Problems failing criteria removed | 68.3% filtered out |
Validation Criteria
| Criterion | Description | Failure Rate |
|---|---|---|
| Problem Clarity | Issue description must be unambiguous | 38.3% |
| Test Fairness | Tests must not incorrectly fail valid solutions | 61.1% |
| Solvability | Problem must be solvable with provided information | Variable |
| Reproducibility | Issue must be reproducible in test environment | Variable |
Evaluation Methodology
Task Structure
For each problem, an AI agent receives:
- **Codebase**: Complete repository at specific commit
- **Issue Description**: Original GitHub issue text
- **Test Environment**: Docker container with dependencies
The agent must: 1. Understand the issue description 2. Explore and understand the codebase 3. Identify relevant files and functions 4. Implement a solution 5. Verify the solution passes all tests
Test Categories
| Test Type | Description | Purpose |
|---|---|---|
| FAIL_TO_PASS | Tests that should fail before fix, pass after | Verify issue is resolved |
| PASS_TO_PASS | Tests that should remain passing | Ensure no regression |
| Additional Tests | Other repository tests | Comprehensive validation |
Both FAIL_TO_PASS and PASS_TO_PASS tests must succeed for a solution to be considered correct.
Scoring Metrics
| Metric | Description | Calculation |
|---|---|---|
| Resolve Rate | Percentage of issues fully resolved | (Resolved issues / Total issues) × 100% |
| Pass@k | Success rate within k attempts | Percentage resolved in k tries |
| Test Pass Rate | Individual test success rate | (Passing tests / Total tests) × 100% |
| Partial Credit | Credit for partial solutions | Based on test subsets passed |
Performance Analysis
Current Leaderboard (2025)
| Rank | Model | Resolve Rate | Organization | Cost/Test | Time (avg) |
|---|---|---|---|---|---|
| 1 | Claude Opus 4.1 | 74.5% | Anthropic | ~$2.50 | ~600s |
| 2 | Claude Sonnet 4 | 72.7% (80.2% with parallel compute) | Anthropic | $1.24 | 426s |
| 3 | Claude Opus 4 | 72.5% (79.4% with parallel compute) | Anthropic | ~$2.50 | ~600s |
| 4 | Claude 3.7 Sonnet (scaffolded) | 70.2% | Anthropic | ~$1.50 | ~500s |
| 5 | Augment Agent (Claude 3.7 + o1) | 65.4% | Augment Code | Variable | Variable |
| 6 | Gemini 2.5 Pro | 63.8% | ~$2.00 | ~500s | |
| 7 | Grok 4 | 58.6% | xAI | ~$2.00 | ~550s |
| 8 | DeepSeek R1-0528 | 57.6% | DeepSeek | ~$0.50 | ~400s |
| 9 | GPT-4.1 | 54.6% | OpenAI | ~$2.00 | ~400s |
| 10 | OpenAI o3 | 49.8% | OpenAI | ~$3.00 | ~700s |
| 11 | Claude 3.5 Sonnet (upgraded) | 49.0% | Anthropic | ~$1.00 | ~450s |
| 12 | GPT-4o | 33.2% | OpenAI | ~$1.50 | ~350s |
Historical Progress
| Date | Best Score | Model | Key Innovation |
|---|---|---|---|
| Aug 2024 | 33.2% | GPT-4o | Baseline at launch |
| Oct 2024 | 45% | Previous SOTA | Improved scaffolding |
| Dec 2024 | 49% | Claude 3.5 Sonnet | Better code understanding |
| Feb 2025 | 57.6% | DeepSeek R1 | Reasoning improvements |
| Apr 2025 | 72.7% | Claude Opus 4 | Advanced problem solving |
Agent Architectures
Successful Approaches
| Approach | Description | Performance Impact |
|---|---|---|
| Agentless | Direct patch generation without complex agents | Baseline approach |
| SWE-agent | Interactive agent with specialized tools | +15-20% over baseline |
| Hybrid Models | Combining multiple models (for example Claude + o1) | +20-25% improvement |
| Custom Scaffolding | Task-specific prompting and tooling | +8-10% improvement |
| Multi-attempt | Multiple solution attempts with verification | +5-10% improvement |
Tool Usage
Successful agents typically use:
| Tool | Function | Usage Frequency |
|---|---|---|
| File Browser | Navigate repository structure | High |
| Code Search | Find relevant code sections | High |
| File Editor | Modify source files | High |
| Test Runner | Execute tests | Medium |
| Debugger | Debug failing tests | Low |
| Documentation Reader | Access project docs | Medium |
Skills Evaluated
Core Competencies
| Skill | Description | Importance |
|---|---|---|
| Code Comprehension | Understanding existing codebases | Critical |
| Debugging | Identifying root causes of issues | Critical |
| Implementation | Writing correct solution code | Critical |
| Testing | Understanding and passing test suites | High |
| Navigation | Finding relevant files and functions | High |
| Problem Analysis | Understanding issue descriptions | High |
Technical Skills
| Area | Specific Skills | Frequency |
|---|---|---|
| Python Proficiency | Syntax, idioms, standard library | Very High |
| Framework Knowledge | Django, Flask, etc. | High |
| Algorithm Design | Efficient problem solving | Medium |
| API Understanding | Library interfaces | High |
| Error Handling | Exception management | Medium |
| Performance Optimization | Efficiency improvements | Low |
Limitations and Criticisms
Current Limitations
| Limitation | Description | Impact |
|---|---|---|
| Python Only | Limited to Python repositories | Reduced generalizability |
| Bug Fix Focus | ~80% bug fixes vs. feature development | Skewed task distribution |
| Repository Bias | Django dominates dataset | Overrepresentation |
| Short Tasks | Most solvable in <1 hour | Limited complexity |
| Structured Descriptions | More LLM-friendly than typical issues | Easier than real-world |
Methodological Concerns
1. **Data Contamination Risk**: Public GitHub issues may appear in training data 2. **Test Suite Quality**: Some test suites may not fully specify requirements 3. **Limited Scope**: Doesn't test full software development lifecycle 4. **Single Language**: Python-only limits broader applicability
Related Benchmarks
Comparison with Other Benchmarks
| Benchmark | Focus | Size | Key Difference |
|---|---|---|---|
| SWE-bench (original) | Unfiltered GitHub issues | 2,294 | Includes unsolvable problems |
| SWE-bench Verified | Validated GitHub issues | 500 | Human-verified quality |
| HumanEval | Function implementation | 164 | Synthetic problems |
| MBPP | Basic programming | 974 | Simple tasks |
| CodeContests | Competitive programming | 10,000+ | Algorithm focus |
| Multi-SWE-bench | Multilingual issues | Variable | Multiple languages |
Complementary Evaluations
- **SWE-bench Multimodal**: Includes visual elements in issues
- **SWE-Lancer**: Focus on freelance-style tasks
- **SWE-rebench**: Continuous evaluation with new issues
- **CodeXGLUE**: Broader code intelligence tasks
Implementation and Usage
Setup and Installation
```bash
- Install SWE-bench
pip install swebench
- Download the verified dataset
from datasets import load_dataset dataset = load_dataset("princeton-nlp/SWE-bench_Verified")
- Set up evaluation environment
docker pull swebench/evaluation:latest ```
Evaluation Example
```python
- Example evaluation script
from swebench import evaluate_model
def run_evaluation(model, dataset):
results = []
for problem in dataset:
# Model attempts to solve the issue
solution = model.solve_issue(
repo=problem['repo'],
issue=problem['issue_description'],
base_commit=problem['base_commit']
)
# Run tests to verify solution
test_results = run_tests(solution, problem['tests'])
results.append(test_results)
return calculate_metrics(results)
```
Impact and Applications
Research Contributions
| Area | Contribution | Impact |
|---|---|---|
| Agent Development | Driving sophisticated coding agents | Rapid progress in AI coding |
| Evaluation Standards | Setting rigorous benchmarking standards | Improved measurement |
| Tool Development | Inspiring new developer tools | Practical applications |
| Training Methods | New approaches to code generation | Better models |
Industry Applications
1. **Automated Bug Fixing**: Production systems for automatic issue resolution 2. **Code Review Assistance**: AI-powered code review tools 3. **Developer Productivity**: IDE integrations for issue solving 4. **Training Data**: Improving code generation models 5. **Hiring Assessment**: Evaluating human developer skills
Future Directions
Planned Improvements
| Enhancement | Description | Timeline |
|---|---|---|
| Language Expansion | Support for Java, JavaScript, Go | 2025-2026 |
| Feature Development | More feature implementation tasks | 2025 |
| Longer Tasks | Multi-day software projects | 2026 |
| Multi-file Edits | Complex refactoring tasks | 2025 |
| Real-time Updates | Continuous benchmark updates | Ongoing |
Research Opportunities
1. **Multi-Agent Collaboration**: Teams of AI agents working together 2. **Human-AI Collaboration**: Hybrid approaches to problem solving 3. **Cross-Repository Learning**: Transfer learning across codebases 4. **Explanation Generation**: Understanding AI problem-solving strategies 5. **Error Prevention**: Proactive bug detection before deployment
Significance
SWE-bench Verified represents a crucial milestone in evaluating AI's practical software engineering capabilities. By focusing on real-world problems from production codebases, it provides a realistic assessment of whether AI systems can perform the day-to-day tasks of professional developers. The benchmark's rapid adoption and the intense competition to improve scores demonstrate its value to both research and industry.
The progression from 33% to over 70% resolution rates in less than a year shows remarkable progress in AI coding capabilities. However, the remaining gap to human performance (100% by definition, as problems are human-verified as solvable) indicates substantial room for improvement. As models continue to advance, SWE-bench Verified serves as a critical measure of progress toward truly autonomous software engineering.
See Also
- SWE-bench
- Software Engineering
- GitHub
- Code Generation
- AI Benchmarks
- Claude
- OpenAI
- Automated Programming
References
- ↑ OpenAI. (2024). "Introducing SWE-bench Verified". August 13, 2024. Retrieved from https://openai.com/index/introducing-swe-bench-verified/
- ↑ Epoch AI. (2024). "What skills does SWE-bench Verified evaluate?". Retrieved from https://epoch.ai/blog/what-skills-does-swe-bench-verified-evaluate
Cite error: <ref> tag with name "jimenez2024" defined in <references> is not used in prior text.
Cite error: <ref> tag with name "princeton2024" defined in <references> is not used in prior text.
Cite error: <ref> tag with name "augment2025" defined in <references> is not used in prior text.
Cite error: <ref> tag with name "anthropic2024" defined in <references> is not used in prior text.
Cite error: <ref> tag with name "swebench2025" defined in <references> is not used in prior text.