SWE-bench

SWE-bench
Overview
Full name	Software Engineering Benchmark
Abbreviation	SWE-bench
Description	A benchmark for evaluating large language models and AI agents on real-world software engineering tasks from GitHub Property "Description" (as page type) with input value "A benchmark for evaluating large language models and AI agents on real-world software engineering tasks from GitHub" contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process.
Release date	2023-10-10
Latest version	SWE-bench Live
Benchmark updated	2025-08-19
Authors	Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, Karthik Narasimhan
Organization	Princeton University, University of Chicago, Stanford University
Technical Details
Type	Software Engineering, Code Generation, Bug Fixing
Modality	Text, Code
Task format	Issue resolution, Code editing
Number of tasks	2294
Total examples	2294 (Full), 500 (Verified), 300 (Lite), 517 (Multimodal), 1319 (Live)
Evaluation metric	% Resolved, Test Pass Rate
Domains	Software Engineering, Python Programming, Open Source Development
Languages	Python, English
Performance
Baseline	1.96
SOTA score	74.5
SOTA model	Claude 4.1 Opus
SOTA date	2025-08-02
Saturated	No
Resources
Website	Official website
Paper	Paper
GitHub	Repository
Dataset	Download
License	MIT License ;

SWE-bench (Software Engineering Benchmark) is a comprehensive benchmark designed to evaluate large language models and AI agents on their ability to solve real-world software engineering tasks. Released on October 10, 2023 by researchers at Princeton University, SWE-bench tests whether AI systems can autonomously resolve genuine GitHub issues from popular open-source repositories. The benchmark has become the de facto standard for evaluating AI-powered software engineering capabilities, with over 2 million downloads and adoption by leading AI research organizations worldwide.^[1]^[2]

Overview

SWE-bench represents a paradigm shift in evaluating code generation capabilities of AI systems. Unlike traditional benchmarks that focus on isolated coding problems, SWE-bench presents AI agents with complete codebases and actual bug reports or feature requests from real software projects. This approach tests not just code writing abilities, but also code comprehension, debugging, testing, and the ability to navigate complex software architectures.^[1]

The benchmark addresses a critical gap in AI evaluation by measuring performance on tasks that professional software engineers encounter daily. Each task in SWE-bench requires understanding issue descriptions, identifying relevant files in large codebases, implementing appropriate fixes, and ensuring that all tests pass - mirroring the complete software development workflow.

Key Characteristics

SWE-bench distinguishes itself through several unique features:

Real-world authenticity: All tasks are derived from actual GitHub issues and their corresponding pull requests
Execution-based evaluation: Solutions are validated using the repository's own test suites, not just code similarity metrics
Multi-file coordination: Tasks often require changes across multiple files, classes, and functions
Large context handling: AI agents must process repositories with thousands of files and millions of lines of code
Continuous updates: New instances can be added to prevent training data contamination

Methodology

Task Construction

SWE-bench tasks are constructed through a systematic process:^[1]

Issue Selection: Real issues from popular Python repositories are identified along with their corresponding pull requests that resolved them
Test Identification: The benchmark identifies tests that transition from failing to passing when the fix is applied (FAIL_TO_PASS tests)
Environment Setup: Each task includes the exact repository state before the fix was applied
Validation: Solutions are verified using both the specific fix tests and regression tests (PASS_TO_PASS tests)

Evaluation Framework

The evaluation process employs a sophisticated infrastructure:^[3]

Containerized Environments: Each evaluation runs in an isolated Docker container with the appropriate dependencies
Automated Testing: Solutions are automatically tested using the repository's test suite
Time Limits: Agents typically have 45 minutes to complete each task
Reset Mechanism: All files are reset to their original state after each agent run

The primary evaluation metric is the % Resolved rate - the percentage of tasks where the agent successfully implements a solution that passes all required tests.

Benchmark Variants

SWE-bench has evolved into multiple specialized variants to address different evaluation needs:

SWE-bench Full

The original benchmark containing 2,294 task instances across 12 popular Python repositories. This represents the most comprehensive and challenging evaluation set, requiring agents to handle the full complexity of real-world software development.^[1]

SWE-bench Verified

Released in collaboration with OpenAI, this variant contains 500 human-validated instances verified by 93 experienced Python developers. Each task was carefully reviewed to ensure:^[4]

Clear problem descriptions
Unambiguous solutions
Proper test coverage
Reasonable difficulty levels

SWE-bench Lite

A curated subset of 300 instances designed for more efficient evaluation. These tasks focus on self-contained functional bug fixes that can be resolved with targeted code changes. SWE-bench Lite has become popular for rapid prototyping and frequent evaluation cycles.^[5]

SWE-bench Multimodal

Introduced in October 2024, this variant contains 517 instances that include visual elements such as:^[6]

Diagrams and charts
UI screenshots
Error visualizations
Architecture diagrams

This variant tests whether AI systems can integrate visual information when solving software engineering tasks.

SWE-bench Live

The most recent variant containing 1,319 task instances created after 2024, covering 93 repositories. This ensures evaluation on problems that couldn't have been in any model's training data.^[7]

Task Categories and Complexity

SWE-bench tasks span various categories of software engineering challenges:

Bug Fixes

The majority of tasks involve fixing bugs in existing code. These range from simple logic errors to complex multi-component issues requiring deep understanding of system architecture.

Feature Implementation

Some tasks require implementing new features based on user requests, testing the ability to extend existing codebases while maintaining compatibility.

Performance Optimization

Tasks may involve improving code efficiency, reducing memory usage, or optimizing algorithms while maintaining correctness.

Test Writing

Certain instances require writing or improving test cases, evaluating understanding of test-driven development practices.

Documentation Updates

Some tasks involve updating documentation to match code changes, testing comprehensive software maintenance abilities.

Performance Results

Current Leaderboard (August 2025)

Performance on SWE-bench has improved dramatically since its release:

SWE-bench Verified Top 10

Rank	Model	Organization	% Resolved	Date
1	Claude 4.1 Opus	Anthropic	74.5%	2025-08-02
2	GPT-5 (medium reasoning)	OpenAI	65.00%	2025-08-07
3	Claude 4 Sonnet	Anthropic	64.93%	2025-05-21
4	GPT-5 mini (medium reasoning)	OpenAI	59.80%	2025-08-07
5	o3	OpenAI	58.40%	2025-05-21
6	Qwen3-Coder 480B	Alibaba	55.40%	2025-08-02
7	Gemini 2.5 Pro	Google	53.60%	2025-05-21
8	Claude 3.7 Sonnet	Anthropic	52.80%	2025-05-21
9	o4-mini	OpenAI	45.00%	2025-05-21
10	DeepSeek-Coder V2.5	DeepSeek	43.20%	2025-03-15

Historical Progress

The improvement in SWE-bench performance demonstrates rapid advancement in AI capabilities:

Time Period	Best Performance	Leading Model	Key Milestone
October 2023	1.96%	Claude 2	Initial benchmark release
March 2024	12.47%	SWE-agent + GPT-4	First system above 10%
June 2024	18.00%	Devin	Commercial agent breakthrough
December 2024	43.00%	Amazon Q Developer	Enterprise adoption
August 2025	74.5%	Claude 4.1 Opus	Current state-of-the-art

Technical Implementation

Infrastructure Requirements

Running SWE-bench evaluations requires:^[3]

Python Environment: Python 3.8+ with conda package manager
Docker: For containerized evaluation environments
Compute Resources: Minimum 16GB RAM, recommended 32GB+ for parallel evaluation
Storage: Approximately 50GB for full dataset and evaluation artifacts
API Access: For testing commercial models (OpenAI, Anthropic, etc.)

Evaluation Pipeline

The standard evaluation pipeline consists of:

Environment Setup: Creating isolated Docker containers for each task
Repository Initialization: Checking out the appropriate repository version
Issue Presentation: Providing the issue description to the AI agent
Code Generation: Agent produces proposed changes
Application: Applying changes to the codebase
Testing: Running FAIL_TO_PASS and PASS_TO_PASS tests
Scoring: Computing resolution rate and other metrics

Metrics and Scoring

SWE-bench employs several evaluation metrics:

% Resolved: Primary metric measuring the percentage of successfully resolved tasks
Pass@k: Success rate when allowing k attempts
Test Pass Rate: Percentage of individual tests passed
Regression Rate: Frequency of breaking existing functionality
Efficiency Metrics: Token usage, API calls, and runtime

Agent Architectures

Various agent architectures have been developed for SWE-bench:

SWE-agent

The official baseline agent developed by Princeton researchers, featuring:^[8]

Interactive bash environment
Specialized commands for code navigation and editing
Iterative refinement based on test feedback
Support for multiple LLM backends

Commercial Agents

Several commercial systems have been optimized for SWE-bench:

Amazon Q Developer Agent: Achieves 37.1% on full benchmark
Atlassian Rovo Dev: Current leader on full benchmark at 41.98%
GitHub Copilot Workspace: Integrated development environment approach
Cursor AI: IDE-based agent with human-in-the-loop capabilities

Research Innovations

Recent research has introduced novel approaches:

Multi-agent systems: Coordinating specialized agents for different subtasks
Retrieval-augmented generation: Enhancing context with relevant code examples
Self-debugging: Iterative refinement based on test failures
Tool-augmented agents: Integration with static analysis and debugging tools

Impact and Adoption

Academic Impact

SWE-bench has catalyzed significant research in AI-powered software engineering:

ICLR 2024 Oral Presentation: Selected for oral presentation at a top ML conference^[9]
2+ Million Downloads: Widespread adoption in the research community
50+ Research Papers: Citing and building upon SWE-bench
Multiple Extensions: Multi-lingual, multi-modal, and domain-specific variants

Industry Adoption

Major technology companies use SWE-bench for:

Model Development: Training and evaluating coding assistants
Product Benchmarking: Comparing commercial offerings
Research Direction: Identifying areas for improvement
Hiring Assessment: Evaluating AI tool capabilities

Open Source Contributions

The SWE-bench ecosystem has fostered:

Community Leaderboards: Public tracking of model performance
Evaluation Tools: Open-source frameworks for running evaluations
Dataset Extensions: Community-contributed task instances
Agent Implementations: Diverse approaches to solving SWE-bench tasks

Limitations and Challenges

Despite its success, SWE-bench has known limitations:

Python-Centric Focus

Currently limited to Python repositories, not representing the full diversity of programming languages and paradigms used in industry.

Repository Selection Bias

The 12 selected repositories may not represent all software engineering domains and complexity levels.

Test Quality Dependency

Evaluation quality depends on the completeness and correctness of repository test suites.

Computational Requirements

Full evaluation requires significant computational resources, limiting accessibility for some researchers.

Future Directions

The SWE-bench team and community are working on several extensions:

Multi-Language Support

Multi-SWE-bench: Extending to Java, JavaScript, and other languages^[10]
Cross-language tasks: Problems requiring polyglot programming skills

Enhanced Evaluation

Human evaluation protocols: Supplementing automated metrics
Code quality metrics: Beyond just functional correctness
Security and performance: Evaluating non-functional requirements

Real-Time Evaluation

Continuous benchmarking: Regular evaluation on fresh issues
Live deployment testing: Evaluation in production-like environments
User study integration: Incorporating developer feedback

Related Benchmarks

SWE-bench complements other code generation and software engineering benchmarks:

HumanEval: Isolated Python programming problems
MBPP: Mostly basic Python programming tasks
CodeContests: Competitive programming challenges
DS-1000: Data science coding problems
RepoEval: Repository-level code completion
CrossCodeEval: Cross-file code completion

References

↑ ^1.0 ^1.1 ^1.2 ^1.3 Jimenez, Carlos E., et al. "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" arXiv preprint arXiv:2310.06770 (2023). Cite error: Invalid <ref> tag; name "arxiv" defined multiple times with different content
↑ SWE-bench Official Website. https://www.swebench.com/ Accessed August 2025.
↑ ^3.0 ^3.1 SWE-bench GitHub Repository. https://github.com/princeton-nlp/SWE-bench Accessed August 2025.
↑ OpenAI. "SWE-bench Verified: A Human-Validated Subset." 2024. Cite error: Invalid <ref> tag; name "verified" defined multiple times with different content
↑ Princeton NLP. "SWE-bench Lite Documentation." 2024. Cite error: Invalid <ref> tag; name "lite" defined multiple times with different content
↑ Yang, John, et al. "SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?" arXiv:2410.03859 (2024). Cite error: Invalid <ref> tag; name "multimodal" defined multiple times with different content
↑ Jimenez, Carlos E., et al. "SWE-bench Goes Live!" arXiv:2505.23419 (2025). Cite error: Invalid <ref> tag; name "live" defined multiple times with different content
↑ Yang, John, et al. "SWE-agent: Agent-Computer Interfaces for Software Engineering." 2024. Cite error: Invalid <ref> tag; name "sweagent" defined multiple times with different content
↑ ICLR 2024. "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" Oral Presentation. Cite error: Invalid <ref> tag; name "iclr" defined multiple times with different content
↑ Chen, et al. "Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving." arXiv:2504.02605 (2025). Cite error: Invalid <ref> tag; name "multi" defined multiple times with different content

External Links

[arxiv-1] 1.0 ^1.1 ^1.2 ^1.3 Jimenez, Carlos E., et al. "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" arXiv preprint arXiv:2310.06770 (2023). Cite error: Invalid <ref> tag; name "arxiv" defined multiple times with different content

[website-2] SWE-bench Official Website. https://www.swebench.com/ Accessed August 2025.

[github-3] 3.0 ^3.1 SWE-bench GitHub Repository. https://github.com/princeton-nlp/SWE-bench Accessed August 2025.

[verified-4] OpenAI. "SWE-bench Verified: A Human-Validated Subset." 2024. Cite error: Invalid <ref> tag; name "verified" defined multiple times with different content

[lite-5] Princeton NLP. "SWE-bench Lite Documentation." 2024. Cite error: Invalid <ref> tag; name "lite" defined multiple times with different content

[multimodal-6] Yang, John, et al. "SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?" arXiv:2410.03859 (2024). Cite error: Invalid <ref> tag; name "multimodal" defined multiple times with different content

[live-7] Jimenez, Carlos E., et al. "SWE-bench Goes Live!" arXiv:2505.23419 (2025). Cite error: Invalid <ref> tag; name "live" defined multiple times with different content

[sweagent-8] Yang, John, et al. "SWE-agent: Agent-Computer Interfaces for Software Engineering." 2024. Cite error: Invalid <ref> tag; name "sweagent" defined multiple times with different content

[iclr-9] ICLR 2024. "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" Oral Presentation. Cite error: Invalid <ref> tag; name "iclr" defined multiple times with different content

[multi-10] Chen, et al. "Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving." arXiv:2504.02605 (2025). Cite error: Invalid <ref> tag; name "multi" defined multiple times with different content

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]