A benchmark for evaluating large language models and AI agents on real-world software engineering tasks from GitHub
Property "Description" (as page type) with input value "A benchmark for evaluating large language models and AI agents on real-world software engineering tasks from GitHub" contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process.
Release date
2023-10-10
Latest version
SWE-bench Live
Benchmark updated
2025-08-19
Authors
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, Karthik Narasimhan
Organization
Princeton University, University of Chicago, Stanford University
SWE-bench (Software Engineering Benchmark) is a comprehensive benchmark designed to evaluate large language models and AI agents on their ability to solve real-world software engineering tasks. Released on October 10, 2023 by researchers at Princeton University, SWE-bench tests whether AI systems can autonomously resolve genuine GitHub issues from popular open-source repositories. The benchmark has become the de facto standard for evaluating AI-powered software engineering capabilities, with over 2 million downloads and adoption by leading AI research organizations worldwide.[1][2]
Overview
SWE-bench represents a paradigm shift in evaluating code generation capabilities of AI systems. Unlike traditional benchmarks that focus on isolated coding problems, SWE-bench presents AI agents with complete codebases and actual bug reports or feature requests from real software projects. This approach tests not just code writing abilities, but also code comprehension, debugging, testing, and the ability to navigate complex software architectures.[1]
The benchmark addresses a critical gap in AI evaluation by measuring performance on tasks that professional software engineers encounter daily. Each task in SWE-bench requires understanding issue descriptions, identifying relevant files in large codebases, implementing appropriate fixes, and ensuring that all tests pass - mirroring the complete software development workflow.
Key Characteristics
SWE-bench distinguishes itself through several unique features:
Real-world authenticity: All tasks are derived from actual GitHub issues and their corresponding pull requests
Execution-based evaluation: Solutions are validated using the repository's own test suites, not just code similarity metrics
Multi-file coordination: Tasks often require changes across multiple files, classes, and functions
Large context handling: AI agents must process repositories with thousands of files and millions of lines of code
Continuous updates: New instances can be added to prevent training data contamination
Methodology
Task Construction
SWE-bench tasks are constructed through a systematic process:[1]
Issue Selection: Real issues from popular Python repositories are identified along with their corresponding pull requests that resolved them
Test Identification: The benchmark identifies tests that transition from failing to passing when the fix is applied (FAIL_TO_PASS tests)
Environment Setup: Each task includes the exact repository state before the fix was applied
Validation: Solutions are verified using both the specific fix tests and regression tests (PASS_TO_PASS tests)
Evaluation Framework
The evaluation process employs a sophisticated infrastructure:[3]
Containerized Environments: Each evaluation runs in an isolated Docker container with the appropriate dependencies
Automated Testing: Solutions are automatically tested using the repository's test suite
Time Limits: Agents typically have 45 minutes to complete each task
Reset Mechanism: All files are reset to their original state after each agent run
The primary evaluation metric is the % Resolved rate - the percentage of tasks where the agent successfully implements a solution that passes all required tests.
Benchmark Variants
SWE-bench has evolved into multiple specialized variants to address different evaluation needs:
SWE-bench Full
The original benchmark containing 2,294 task instances across 12 popular Python repositories. This represents the most comprehensive and challenging evaluation set, requiring agents to handle the full complexity of real-world software development.[1]
SWE-bench Verified
Released in collaboration with OpenAI, this variant contains 500 human-validated instances verified by 93 experienced Python developers. Each task was carefully reviewed to ensure:[4]
Clear problem descriptions
Unambiguous solutions
Proper test coverage
Reasonable difficulty levels
SWE-bench Lite
A curated subset of 300 instances designed for more efficient evaluation. These tasks focus on self-contained functional bug fixes that can be resolved with targeted code changes. SWE-bench Lite has become popular for rapid prototyping and frequent evaluation cycles.[5]
SWE-bench Multimodal
Introduced in October 2024, this variant contains 517 instances that include visual elements such as:[6]
Diagrams and charts
UI screenshots
Error visualizations
Architecture diagrams
This variant tests whether AI systems can integrate visual information when solving software engineering tasks.
SWE-bench Live
The most recent variant containing 1,319 task instances created after 2024, covering 93 repositories. This ensures evaluation on problems that couldn't have been in any model's training data.[7]
Task Categories and Complexity
SWE-bench tasks span various categories of software engineering challenges:
Bug Fixes
The majority of tasks involve fixing bugs in existing code. These range from simple logic errors to complex multi-component issues requiring deep understanding of system architecture.
Feature Implementation
Some tasks require implementing new features based on user requests, testing the ability to extend existing codebases while maintaining compatibility.
Performance Optimization
Tasks may involve improving code efficiency, reducing memory usage, or optimizing algorithms while maintaining correctness.
Test Writing
Certain instances require writing or improving test cases, evaluating understanding of test-driven development practices.
Documentation Updates
Some tasks involve updating documentation to match code changes, testing comprehensive software maintenance abilities.
Performance Results
Current Leaderboard (August 2025)
Performance on SWE-bench has improved dramatically since its release:
Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., & Narasimhan, K. (2024). "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" International Conference on Learning Representations (ICLR 2024). arXiv:2310.06770.