SWE-bench Verified

SWE-bench Verified
Overview
Full name	Software Engineering Benchmark - Verified
Abbreviation	SWE-bench V
Description	A human-validated subset of real-world GitHub issues for evaluating AI models' autonomous software engineering capabilities
Release date	2024-08-13
Latest version	1.0
Benchmark updated	2024-08
Authors	Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, Karthik Narasimhan
Organization	Princeton University, OpenAI
Technical Details
Type	Code Generation, Bug Fixing, Software Engineering
Modality	Text, Code
Task format	GitHub issue resolution
Number of tasks	500
Total examples	500 verified issues
Evaluation metric	Resolve rate, Test pass rate
Domains	Web frameworks, Scientific computing, Documentation, Machine learning
Languages	Python
Performance
Human performance	100% (verified solvable)
Baseline	33.2% (GPT-4o at launch)
SOTA score	74.5%
SOTA model	Claude Opus 4.1
SOTA date	2025
Saturated	No
Resources
Website	Official website
Paper	Paper
GitHub	Repository
Dataset	Download
License	MIT
Predecessor	SWE-bench

SWE-bench Verified is a rigorously validated software engineering benchmark designed to evaluate artificial intelligence models' ability to autonomously resolve real-world GitHub issues. Released on August 13, 2024, through a collaboration between Princeton University and OpenAI Preparedness, SWE-bench Verified consists of 500 carefully selected and human-verified problems from the original SWE-bench dataset. Each problem requires an AI model to understand a bug report or feature request, navigate a complex codebase, write solution code, and ensure all tests pass, mirroring the complete workflow of a professional software engineer.

Overview

SWE-bench Verified addresses critical limitations in the original SWE-bench dataset by ensuring all included problems are solvable and fairly evaluated. The benchmark tests whether AI systems can perform end-to-end software engineering tasks: from understanding issue descriptions to implementing working solutions that pass comprehensive test suites^[1].

Motivation

The creation of SWE-bench Verified was motivated by several factors:

**Quality Issues**: 68.3% of original SWE-bench samples had problems with underspecified descriptions or unfair test criteria
**Evaluation Reliability**: Need for accurate measurement of AI coding capabilities
**Real-World Relevance**: Focus on actual software engineering tasks rather than synthetic problems
**Autonomous Capability**: Testing complete problem-solving rather than code completion

Problem Characteristics

Repository Distribution

SWE-bench Verified draws from 12 popular Python open-source projects:

Repository	Description	Approximate % of Dataset
Django	Web framework	~45%
SymPy	Symbolic mathematics	~15%
Sphinx	Documentation generator	~10%
Matplotlib	Plotting library	~8%
Scikit-learn	Machine learning library	~7%
Flask	Micro web framework	~5%
Requests	HTTP library	~3%
Pytest	Testing framework	~2%
Astropy	Astronomy tools	~2%
Xarray	N-D labeled arrays	~1%
Seaborn	Statistical visualization	~1%
Pylint	Code analysis	~1%

The five largest repositories (Django, SymPy, Sphinx, Matplotlib, Scikit-learn) account for over 80% of the benchmark^[2].

Problem Categories

Category	Description	Example Tasks
Bug Fixes	Resolving reported bugs	Fixing edge cases, correcting logic errors
Feature Implementation	Adding new functionality	Implementing requested features
Performance Issues	Optimization problems	Improving efficiency, reducing memory usage
Documentation	Documentation-related issues	Updating docstrings, fixing examples
Compatibility	Cross-version compatibility	Python version compatibility fixes
Testing	Test-related issues	Fixing test failures, adding test coverage

Difficulty Distribution

Difficulty	Number of Problems	Time to Solve	Description
Easy	196	<15 minutes	Simple bug fixes, minor changes
Medium	259	15-60 minutes	Moderate complexity, multiple file changes
Hard	45	>1 hour	Complex issues, architectural changes

Human Validation Process

Validation Methodology

The validation process involved rigorous human review:

Step	Process	Outcome
Annotator Selection	93 Python-experienced developers recruited	Expert review team assembled
Review Protocol	Detailed rubric for evaluation	Consistent assessment criteria
Triple Review	Each problem reviewed by 3 independent annotators	Multiple perspectives
Quality Criteria	Assessed clarity, test appropriateness, solvability	Comprehensive evaluation
Filtering	Problems failing criteria removed	68.3% filtered out

Validation Criteria

Criterion	Description	Failure Rate
Problem Clarity	Issue description must be unambiguous	38.3%
Test Fairness	Tests must not incorrectly fail valid solutions	61.1%
Solvability	Problem must be solvable with provided information	Variable
Reproducibility	Issue must be reproducible in test environment	Variable

Evaluation Methodology

Task Structure

For each problem, an AI agent receives:

**Codebase**: Complete repository at specific commit
**Issue Description**: Original GitHub issue text
**Test Environment**: Docker container with dependencies

The agent must: 1. Understand the issue description 2. Explore and understand the codebase 3. Identify relevant files and functions 4. Implement a solution 5. Verify the solution passes all tests

Test Categories

Test Type	Description	Purpose
FAIL_TO_PASS	Tests that should fail before fix, pass after	Verify issue is resolved
PASS_TO_PASS	Tests that should remain passing	Ensure no regression
Additional Tests	Other repository tests	Comprehensive validation

Both FAIL_TO_PASS and PASS_TO_PASS tests must succeed for a solution to be considered correct.

Scoring Metrics

Metric	Description	Calculation
Resolve Rate	Percentage of issues fully resolved	(Resolved issues / Total issues) × 100%
Pass@k	Success rate within k attempts	Percentage resolved in k tries
Test Pass Rate	Individual test success rate	(Passing tests / Total tests) × 100%
Partial Credit	Credit for partial solutions	Based on test subsets passed

Performance Analysis

Current Leaderboard (2025)

Rank	Model	Resolve Rate	Organization	Cost/Test	Time (avg)
1	Claude Opus 4.1	74.5%	Anthropic	~$2.50	~600s
2	Claude Sonnet 4	72.7% (80.2% with parallel compute)	Anthropic	$1.24	426s
3	Claude Opus 4	72.5% (79.4% with parallel compute)	Anthropic	~$2.50	~600s
4	Claude 3.7 Sonnet (scaffolded)	70.2%	Anthropic	~$1.50	~500s
5	Augment Agent (Claude 3.7 + o1)	65.4%	Augment Code	Variable	Variable
6	Gemini 2.5 Pro	63.8%	Google	~$2.00	~500s
7	Grok 4	58.6%	xAI	~$2.00	~550s
8	DeepSeek R1-0528	57.6%	DeepSeek	~$0.50	~400s
9	GPT-4.1	54.6%	OpenAI	~$2.00	~400s
10	OpenAI o3	49.8%	OpenAI	~$3.00	~700s
11	Claude 3.5 Sonnet (upgraded)	49.0%	Anthropic	~$1.00	~450s
12	GPT-4o	33.2%	OpenAI	~$1.50	~350s

Historical Progress

Date	Best Score	Model	Key Innovation
Aug 2024	33.2%	GPT-4o	Baseline at launch
Oct 2024	45%	Previous SOTA	Improved scaffolding
Dec 2024	49%	Claude 3.5 Sonnet	Better code understanding
Feb 2025	57.6%	DeepSeek R1	Reasoning improvements
Apr 2025	72.7%	Claude Opus 4	Advanced problem solving

Agent Architectures

Successful Approaches

Approach	Description	Performance Impact
Agentless	Direct patch generation without complex agents	Baseline approach
SWE-agent	Interactive agent with specialized tools	+15-20% over baseline
Hybrid Models	Combining multiple models (for example Claude + o1)	+20-25% improvement
Custom Scaffolding	Task-specific prompting and tooling	+8-10% improvement
Multi-attempt	Multiple solution attempts with verification	+5-10% improvement

Tool Usage

Successful agents typically use:

Tool	Function	Usage Frequency
File Browser	Navigate repository structure	High
Code Search	Find relevant code sections	High
File Editor	Modify source files	High
Test Runner	Execute tests	Medium
Debugger	Debug failing tests	Low
Documentation Reader	Access project docs	Medium

Skills Evaluated

Core Competencies

Skill	Description	Importance
Code Comprehension	Understanding existing codebases	Critical
Debugging	Identifying root causes of issues	Critical
Implementation	Writing correct solution code	Critical
Testing	Understanding and passing test suites	High
Navigation	Finding relevant files and functions	High
Problem Analysis	Understanding issue descriptions	High

Technical Skills

Area	Specific Skills	Frequency
Python Proficiency	Syntax, idioms, standard library	Very High
Framework Knowledge	Django, Flask, etc.	High
Algorithm Design	Efficient problem solving	Medium
API Understanding	Library interfaces	High
Error Handling	Exception management	Medium
Performance Optimization	Efficiency improvements	Low

Limitations and Criticisms

Current Limitations

Limitation	Description	Impact
Python Only	Limited to Python repositories	Reduced generalizability
Bug Fix Focus	~80% bug fixes vs. feature development	Skewed task distribution
Repository Bias	Django dominates dataset	Overrepresentation
Short Tasks	Most solvable in <1 hour	Limited complexity
Structured Descriptions	More LLM-friendly than typical issues	Easier than real-world

Methodological Concerns

1. **Data Contamination Risk**: Public GitHub issues may appear in training data 2. **Test Suite Quality**: Some test suites may not fully specify requirements 3. **Limited Scope**: Doesn't test full software development lifecycle 4. **Single Language**: Python-only limits broader applicability

Related Benchmarks

Comparison with Other Benchmarks

Benchmark	Focus	Size	Key Difference
SWE-bench (original)	Unfiltered GitHub issues	2,294	Includes unsolvable problems
SWE-bench Verified	Validated GitHub issues	500	Human-verified quality
HumanEval	Function implementation	164	Synthetic problems
MBPP	Basic programming	974	Simple tasks
CodeContests	Competitive programming	10,000+	Algorithm focus
Multi-SWE-bench	Multilingual issues	Variable	Multiple languages

Complementary Evaluations

**SWE-bench Multimodal**: Includes visual elements in issues
**SWE-Lancer**: Focus on freelance-style tasks
**SWE-rebench**: Continuous evaluation with new issues
**CodeXGLUE**: Broader code intelligence tasks

Implementation and Usage

Setup and Installation

```bash

Install SWE-bench

pip install swebench

Download the verified dataset

from datasets import load_dataset dataset = load_dataset("princeton-nlp/SWE-bench_Verified")

Set up evaluation environment

docker pull swebench/evaluation:latest ```

Evaluation Example

```python

Example evaluation script

from swebench import evaluate_model

def run_evaluation(model, dataset):

   results = []
   for problem in dataset:
       # Model attempts to solve the issue
       solution = model.solve_issue(
           repo=problem['repo'],
           issue=problem['issue_description'],
           base_commit=problem['base_commit']
       )
       
       # Run tests to verify solution
       test_results = run_tests(solution, problem['tests'])
       results.append(test_results)
   
   return calculate_metrics(results)

```

Impact and Applications

Research Contributions

Area	Contribution	Impact
Agent Development	Driving sophisticated coding agents	Rapid progress in AI coding
Evaluation Standards	Setting rigorous benchmarking standards	Improved measurement
Tool Development	Inspiring new developer tools	Practical applications
Training Methods	New approaches to code generation	Better models

Industry Applications

1. **Automated Bug Fixing**: Production systems for automatic issue resolution 2. **Code Review Assistance**: AI-powered code review tools 3. **Developer Productivity**: IDE integrations for issue solving 4. **Training Data**: Improving code generation models 5. **Hiring Assessment**: Evaluating human developer skills

Future Directions

Planned Improvements

Enhancement	Description	Timeline
Language Expansion	Support for Java, JavaScript, Go	2025-2026
Feature Development	More feature implementation tasks	2025
Longer Tasks	Multi-day software projects	2026
Multi-file Edits	Complex refactoring tasks	2025
Real-time Updates	Continuous benchmark updates	Ongoing

Research Opportunities

1. **Multi-Agent Collaboration**: Teams of AI agents working together 2. **Human-AI Collaboration**: Hybrid approaches to problem solving 3. **Cross-Repository Learning**: Transfer learning across codebases 4. **Explanation Generation**: Understanding AI problem-solving strategies 5. **Error Prevention**: Proactive bug detection before deployment

Significance

SWE-bench Verified represents a crucial milestone in evaluating AI's practical software engineering capabilities. By focusing on real-world problems from production codebases, it provides a realistic assessment of whether AI systems can perform the day-to-day tasks of professional developers. The benchmark's rapid adoption and the intense competition to improve scores demonstrate its value to both research and industry.

The progression from 33% to over 70% resolution rates in less than a year shows remarkable progress in AI coding capabilities. However, the remaining gap to human performance (100% by definition, as problems are human-verified as solvable) indicates substantial room for improvement. As models continue to advance, SWE-bench Verified serves as a critical measure of progress toward truly autonomous software engineering.

References

↑ OpenAI. (2024). "Introducing SWE-bench Verified". August 13, 2024. Retrieved from https://openai.com/index/introducing-swe-bench-verified/
↑ Epoch AI. (2024). "What skills does SWE-bench Verified evaluate?". Retrieved from https://epoch.ai/blog/what-skills-does-swe-bench-verified-evaluate

Cite error: <ref> tag with name "jimenez2024" defined in <references> is not used in prior text.
Cite error: <ref> tag with name "princeton2024" defined in <references> is not used in prior text.
Cite error: <ref> tag with name "augment2025" defined in <references> is not used in prior text.
Cite error: <ref> tag with name "anthropic2024" defined in <references> is not used in prior text.
Cite error: <ref> tag with name "swebench2025" defined in <references> is not used in prior text.

[openai2024-1] OpenAI. (2024). "Introducing SWE-bench Verified". August 13, 2024. Retrieved from https://openai.com/index/introducing-swe-bench-verified/

[epoch2024-2] Epoch AI. (2024). "What skills does SWE-bench Verified evaluate?". Retrieved from https://epoch.ai/blog/what-skills-does-swe-bench-verified-evaluate

[1]

[2]