SWE-bench
| SWE-bench | |
|---|---|
| Overview | |
| Full name | Software Engineering Benchmark |
| Abbreviation | SWE-bench |
| Description | A benchmark for evaluating large language models and AI agents on real-world software engineering tasks from GitHub
Property "Description" (as page type) with input value "A benchmark for evaluating large language models and AI agents on real-world software engineering tasks from GitHub" contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process. |
| Release date | 2023-10-10 |
| Latest version | SWE-bench Live |
| Benchmark updated | 2025-08-19 |
| Authors | Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, Karthik Narasimhan |
| Organization | Princeton University, University of Chicago, Stanford University |
| Technical Details | |
| Type | Software Engineering, Code Generation, Bug Fixing |
| Modality | Text, Code |
| Task format | Issue resolution, Code editing |
| Number of tasks | 2294 |
| Total examples | 2294 (Full), 500 (Verified), 300 (Lite), 517 (Multimodal), 1319 (Live) |
| Evaluation metric | % Resolved, Test Pass Rate |
| Domains | Software Engineering, Python Programming, Open Source Development |
| Languages | Python, English |
| Performance | |
| Baseline | 1.96 |
| SOTA score | 74.5 |
| SOTA model | Claude 4.1 Opus |
| SOTA date | 2025-08-02 |
| Saturated | No |
| Resources | |
| Website | Official website |
| Paper | Paper |
| GitHub | Repository |
| Dataset | Download |
| License | MIT License
|
SWE-bench (Software Engineering Benchmark) is a comprehensive benchmark designed to evaluate large language models and AI agents on their ability to solve real-world software engineering tasks. Released on October 10, 2023 by researchers at Princeton University, SWE-bench tests whether AI systems can autonomously resolve genuine GitHub issues from popular open-source repositories. The benchmark has become the de facto standard for evaluating AI-powered software engineering capabilities, with over 2 million downloads and adoption by leading AI research organizations worldwide.[1][2]
Overview
SWE-bench represents a paradigm shift in evaluating code generation capabilities of AI systems. Unlike traditional benchmarks that focus on isolated coding problems, SWE-bench presents AI agents with complete codebases and actual bug reports or feature requests from real software projects. This approach tests not just code writing abilities, but also code comprehension, debugging, testing, and the ability to navigate complex software architectures.[1]
The benchmark addresses a critical gap in AI evaluation by measuring performance on tasks that professional software engineers encounter daily. Each task in SWE-bench requires understanding issue descriptions, identifying relevant files in large codebases, implementing appropriate fixes, and ensuring that all tests pass - mirroring the complete software development workflow.
Key Characteristics
SWE-bench distinguishes itself through several unique features:
- Real-world authenticity: All tasks are derived from actual GitHub issues and their corresponding pull requests
- Execution-based evaluation: Solutions are validated using the repository's own test suites, not just code similarity metrics
- Multi-file coordination: Tasks often require changes across multiple files, classes, and functions
- Large context handling: AI agents must process repositories with thousands of files and millions of lines of code
- Continuous updates: New instances can be added to prevent training data contamination
Methodology
Task Construction
SWE-bench tasks are constructed through a systematic process:[1]
- Issue Selection: Real issues from popular Python repositories are identified along with their corresponding pull requests that resolved them
- Test Identification: The benchmark identifies tests that transition from failing to passing when the fix is applied (FAIL_TO_PASS tests)
- Environment Setup: Each task includes the exact repository state before the fix was applied
- Validation: Solutions are verified using both the specific fix tests and regression tests (PASS_TO_PASS tests)
Evaluation Framework
The evaluation process employs a sophisticated infrastructure:[3]
- Containerized Environments: Each evaluation runs in an isolated Docker container with the appropriate dependencies
- Automated Testing: Solutions are automatically tested using the repository's test suite
- Time Limits: Agents typically have 45 minutes to complete each task
- Reset Mechanism: All files are reset to their original state after each agent run
The primary evaluation metric is the % Resolved rate - the percentage of tasks where the agent successfully implements a solution that passes all required tests.
Benchmark Variants
SWE-bench has evolved into multiple specialized variants to address different evaluation needs:
SWE-bench Full
The original benchmark containing 2,294 task instances across 12 popular Python repositories. This represents the most comprehensive and challenging evaluation set, requiring agents to handle the full complexity of real-world software development.[1]
SWE-bench Verified
Released in collaboration with OpenAI, this variant contains 500 human-validated instances verified by 93 experienced Python developers. Each task was carefully reviewed to ensure:[4]
- Clear problem descriptions
- Unambiguous solutions
- Proper test coverage
- Reasonable difficulty levels
SWE-bench Lite
A curated subset of 300 instances designed for more efficient evaluation. These tasks focus on self-contained functional bug fixes that can be resolved with targeted code changes. SWE-bench Lite has become popular for rapid prototyping and frequent evaluation cycles.[5]
SWE-bench Multimodal
Introduced in October 2024, this variant contains 517 instances that include visual elements such as:[6]
- Diagrams and charts
- UI screenshots
- Error visualizations
- Architecture diagrams
This variant tests whether AI systems can integrate visual information when solving software engineering tasks.
SWE-bench Live
The most recent variant containing 1,319 task instances created after 2024, covering 93 repositories. This ensures evaluation on problems that couldn't have been in any model's training data.[7]
Task Categories and Complexity
SWE-bench tasks span various categories of software engineering challenges:
Bug Fixes
The majority of tasks involve fixing bugs in existing code. These range from simple logic errors to complex multi-component issues requiring deep understanding of system architecture.
Feature Implementation
Some tasks require implementing new features based on user requests, testing the ability to extend existing codebases while maintaining compatibility.
Performance Optimization
Tasks may involve improving code efficiency, reducing memory usage, or optimizing algorithms while maintaining correctness.
Test Writing
Certain instances require writing or improving test cases, evaluating understanding of test-driven development practices.
Documentation Updates
Some tasks involve updating documentation to match code changes, testing comprehensive software maintenance abilities.
Performance Results
Current Leaderboard (August 2025)
Performance on SWE-bench has improved dramatically since its release:
SWE-bench Verified Top 10
| Rank | Model | Organization | % Resolved | Date |
|---|---|---|---|---|
| 1 | Claude 4.1 Opus | Anthropic | 74.5% | 2025-08-02 |
| 2 | GPT-5 (medium reasoning) | OpenAI | 65.00% | 2025-08-07 |
| 3 | Claude 4 Sonnet | Anthropic | 64.93% | 2025-05-21 |
| 4 | GPT-5 mini (medium reasoning) | OpenAI | 59.80% | 2025-08-07 |
| 5 | o3 | OpenAI | 58.40% | 2025-05-21 |
| 6 | Qwen3-Coder 480B | Alibaba | 55.40% | 2025-08-02 |
| 7 | Gemini 2.5 Pro | 53.60% | 2025-05-21 | |
| 8 | Claude 3.7 Sonnet | Anthropic | 52.80% | 2025-05-21 |
| 9 | o4-mini | OpenAI | 45.00% | 2025-05-21 |
| 10 | DeepSeek-Coder V2.5 | DeepSeek | 43.20% | 2025-03-15 |
Historical Progress
The improvement in SWE-bench performance demonstrates rapid advancement in AI capabilities:
| Time Period | Best Performance | Leading Model | Key Milestone |
|---|---|---|---|
| October 2023 | 1.96% | Claude 2 | Initial benchmark release |
| March 2024 | 12.47% | SWE-agent + GPT-4 | First system above 10% |
| June 2024 | 18.00% | Devin | Commercial agent breakthrough |
| December 2024 | 43.00% | Amazon Q Developer | Enterprise adoption |
| August 2025 | 74.5% | Claude 4.1 Opus | Current state-of-the-art |
Technical Implementation
Infrastructure Requirements
Running SWE-bench evaluations requires:[3]
- Python Environment: Python 3.8+ with conda package manager
- Docker: For containerized evaluation environments
- Compute Resources: Minimum 16GB RAM, recommended 32GB+ for parallel evaluation
- Storage: Approximately 50GB for full dataset and evaluation artifacts
- API Access: For testing commercial models (OpenAI, Anthropic, etc.)
Evaluation Pipeline
The standard evaluation pipeline consists of:
- Environment Setup: Creating isolated Docker containers for each task
- Repository Initialization: Checking out the appropriate repository version
- Issue Presentation: Providing the issue description to the AI agent
- Code Generation: Agent produces proposed changes
- Application: Applying changes to the codebase
- Testing: Running FAIL_TO_PASS and PASS_TO_PASS tests
- Scoring: Computing resolution rate and other metrics
Metrics and Scoring
SWE-bench employs several evaluation metrics:
- % Resolved: Primary metric measuring the percentage of successfully resolved tasks
- Pass@k: Success rate when allowing k attempts
- Test Pass Rate: Percentage of individual tests passed
- Regression Rate: Frequency of breaking existing functionality
- Efficiency Metrics: Token usage, API calls, and runtime
Agent Architectures
Various agent architectures have been developed for SWE-bench:
SWE-agent
The official baseline agent developed by Princeton researchers, featuring:[8]
- Interactive bash environment
- Specialized commands for code navigation and editing
- Iterative refinement based on test feedback
- Support for multiple LLM backends
Commercial Agents
Several commercial systems have been optimized for SWE-bench:
- Amazon Q Developer Agent: Achieves 37.1% on full benchmark
- Atlassian Rovo Dev: Current leader on full benchmark at 41.98%
- GitHub Copilot Workspace: Integrated development environment approach
- Cursor AI: IDE-based agent with human-in-the-loop capabilities
Research Innovations
Recent research has introduced novel approaches:
- Multi-agent systems: Coordinating specialized agents for different subtasks
- Retrieval-augmented generation: Enhancing context with relevant code examples
- Self-debugging: Iterative refinement based on test failures
- Tool-augmented agents: Integration with static analysis and debugging tools
Impact and Adoption
Academic Impact
SWE-bench has catalyzed significant research in AI-powered software engineering:
- ICLR 2024 Oral Presentation: Selected for oral presentation at a top ML conference[9]
- 2+ Million Downloads: Widespread adoption in the research community
- 50+ Research Papers: Citing and building upon SWE-bench
- Multiple Extensions: Multi-lingual, multi-modal, and domain-specific variants
Industry Adoption
Major technology companies use SWE-bench for:
- Model Development: Training and evaluating coding assistants
- Product Benchmarking: Comparing commercial offerings
- Research Direction: Identifying areas for improvement
- Hiring Assessment: Evaluating AI tool capabilities
Open Source Contributions
The SWE-bench ecosystem has fostered:
- Community Leaderboards: Public tracking of model performance
- Evaluation Tools: Open-source frameworks for running evaluations
- Dataset Extensions: Community-contributed task instances
- Agent Implementations: Diverse approaches to solving SWE-bench tasks
Limitations and Challenges
Despite its success, SWE-bench has known limitations:
Python-Centric Focus
Currently limited to Python repositories, not representing the full diversity of programming languages and paradigms used in industry.
Repository Selection Bias
The 12 selected repositories may not represent all software engineering domains and complexity levels.
Test Quality Dependency
Evaluation quality depends on the completeness and correctness of repository test suites.
Computational Requirements
Full evaluation requires significant computational resources, limiting accessibility for some researchers.
Future Directions
The SWE-bench team and community are working on several extensions:
Multi-Language Support
- Multi-SWE-bench: Extending to Java, JavaScript, and other languages[10]
- Cross-language tasks: Problems requiring polyglot programming skills
Enhanced Evaluation
- Human evaluation protocols: Supplementing automated metrics
- Code quality metrics: Beyond just functional correctness
- Security and performance: Evaluating non-functional requirements
Real-Time Evaluation
- Continuous benchmarking: Regular evaluation on fresh issues
- Live deployment testing: Evaluation in production-like environments
- User study integration: Incorporating developer feedback
Related Benchmarks
SWE-bench complements other code generation and software engineering benchmarks:
- HumanEval: Isolated Python programming problems
- MBPP: Mostly basic Python programming tasks
- CodeContests: Competitive programming challenges
- DS-1000: Data science coding problems
- RepoEval: Repository-level code completion
- CrossCodeEval: Cross-file code completion
See Also
- Software engineering
- Code generation
- Large language models
- AI agents
- GitHub
- Test-driven development
- Automated debugging
- Program synthesis
References
- ↑ 1.0 1.1 1.2 1.3 Jimenez, Carlos E., et al. "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" arXiv preprint arXiv:2310.06770 (2023). Cite error: Invalid
<ref>tag; name "arxiv" defined multiple times with different content - ↑ SWE-bench Official Website. https://www.swebench.com/ Accessed August 2025.
- ↑ 3.0 3.1 SWE-bench GitHub Repository. https://github.com/princeton-nlp/SWE-bench Accessed August 2025.
- ↑ OpenAI. "SWE-bench Verified: A Human-Validated Subset." 2024. Cite error: Invalid
<ref>tag; name "verified" defined multiple times with different content - ↑ Princeton NLP. "SWE-bench Lite Documentation." 2024. Cite error: Invalid
<ref>tag; name "lite" defined multiple times with different content - ↑ Yang, John, et al. "SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?" arXiv:2410.03859 (2024). Cite error: Invalid
<ref>tag; name "multimodal" defined multiple times with different content - ↑ Jimenez, Carlos E., et al. "SWE-bench Goes Live!" arXiv:2505.23419 (2025). Cite error: Invalid
<ref>tag; name "live" defined multiple times with different content - ↑ Yang, John, et al. "SWE-agent: Agent-Computer Interfaces for Software Engineering." 2024. Cite error: Invalid
<ref>tag; name "sweagent" defined multiple times with different content - ↑ ICLR 2024. "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" Oral Presentation. Cite error: Invalid
<ref>tag; name "iclr" defined multiple times with different content - ↑ Chen, et al. "Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving." arXiv:2504.02605 (2025). Cite error: Invalid
<ref>tag; name "multi" defined multiple times with different content