SWE-bench

SWE-bench (Software Engineering Benchmark) is an evaluation framework that tests whether AI systems can resolve real-world software engineering tasks drawn from actual GitHub issues and pull requests. Created by Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan at Princeton University's Language and Intelligence group, SWE-bench was introduced in October 2023 and published at ICLR 2024. The original dataset contains 2,294 task instances collected from 12 popular open-source Python repositories. Unlike code generation benchmarks such as HumanEval that test whether a model can write a single function, SWE-bench tests whether an AI agent can read a bug report or feature request, navigate a large codebase, identify the relevant files and functions, write a correct patch, and have that patch pass the project's test suite. This end-to-end evaluation makes SWE-bench one of the most realistic and demanding benchmarks for AI-assisted software engineering.

Background and motivation

By 2023, AI code generation had made impressive progress on isolated programming problems. Models were scoring above 80% on HumanEval and tackling competitive programming challenges. But anyone who had used an AI coding assistant on a real software project knew there was a large gap between writing a single function from a docstring and doing actual software engineering.

Real software engineering involves understanding existing codebases with thousands of files, tracking down bugs that span multiple modules, writing fixes that do not break other functionality, and following project-specific conventions and patterns. None of the existing benchmarks captured this complexity.

Jimenez and colleagues at Princeton set out to create a benchmark that would bridge this gap. Their insight was to use the natural workflow of open-source development as a source of tasks: every resolved GitHub issue, paired with its fix and the tests that verify the fix, is essentially a software engineering problem with a known solution and an automatic grading rubric.

Dataset construction

The construction of SWE-bench involved several stages.

Repository selection

The team selected 12 popular, actively maintained, well-tested Python repositories spanning different domains:

Repository	Domain	Issues in dataset
django/django	Web framework	850
sympy/sympy	Symbolic mathematics	386
scikit-learn/scikit-learn	Machine learning	229
sphinx-doc/sphinx	Documentation generator	187
matplotlib/matplotlib	Data visualization	184
pytest-dev/pytest	Testing framework	119
pydata/xarray	Labeled arrays	110
astropy/astropy	Astronomy	95
pylint-dev/pylint	Code linting	57
psf/requests	HTTP library	44
mwaskom/seaborn	Statistical visualization	22
pallets/flask	Web microframework	11

These repositories were chosen because they are large (tens of thousands to hundreds of thousands of lines of code), well-maintained, and have comprehensive test suites.

Task distribution analysis

The dataset is heavily skewed toward a few repositories. Django alone accounts for 37% of all tasks, and the top three repositories (Django, SymPy, and scikit-learn) account for nearly 64% of the dataset. This concentration has implications for evaluation: a model's SWE-bench score is disproportionately influenced by its ability to work with Django's codebase, conventions, and testing patterns.

Repository	% of total dataset	% of SWE-bench Verified
django/django	37.1%	46.2%
sympy/sympy	16.8%	15.0%
scikit-learn/scikit-learn	10.0%	6.4%
sphinx-doc/sphinx	8.1%	8.8%
matplotlib/matplotlib	8.0%	6.8%
All others	20.0%	16.8%

The skew is even more pronounced in SWE-bench Verified, where Django accounts for 46.2% of tasks. This means that a model's SWE-bench Verified score is nearly half determined by its Django proficiency.

Data scraping and filtering

The team scraped roughly 90,000 pull requests from these 12 repositories. Each pull request was analyzed to identify whether it resolved one or more GitHub issues and whether it included new or modified unit tests that validated the fix. Task instances were retained only if they met both criteria: the PR had to reference a resolved issue, and it had to include test changes that could serve as the evaluation criterion.

After filtering, 2,294 task instances remained. Each task instance includes:

The issue description (the bug report or feature request, as written by the original reporter)
The repository state at the time the issue was filed (checked out to the correct commit)
The gold patch (the actual code changes from the merged pull request)
The test patch (new or modified tests that verify the fix)

Task format

When an AI agent tackles a SWE-bench task, it receives the issue description and access to the repository at the pre-fix commit. The agent must:

Read and understand the issue description.
Explore the codebase to find the relevant files and functions.
Understand the existing code well enough to identify the root cause.
Write a patch that resolves the issue.
Ensure the patch passes both the new tests ("fail-to-pass" tests that verify the fix) and the existing tests ("pass-to-pass" tests that verify nothing else broke).

The agent's output is a unified diff patch. Evaluation is fully automated: the patch is applied to the repository, the test suite is run in a Docker container, and the task is marked as resolved only if all relevant tests pass.

Task difficulty spectrum

SWE-bench tasks span a wide range of difficulty:

Difficulty Level	Characteristics	Approximate proportion
Simple bug fixes	Single-line changes, obvious from issue description	~15-20%
Moderate fixes	Multi-line changes in one file, requires code understanding	~35-40%
Complex fixes	Changes across multiple files, requires deep codebase knowledge	~25-30%
Feature additions	New functionality requiring architectural understanding	~15-20%

The simplest tasks might involve fixing a typo in an error message or correcting an off-by-one error. The hardest tasks might require understanding the interaction between multiple subsystems, adding new configuration options, or refactoring existing code to support a new use case.

How AI agents solve SWE-bench tasks

SWE-bench tasks require more than a language model generating code. They require an agent that can interact with a codebase, explore files, run tests, and iterate on solutions. Understanding these agent architectures is essential for interpreting SWE-bench scores.

The typical agent workflow

A SWE-bench agent follows a workflow that mirrors how a human developer approaches an unfamiliar bug report:

Step 1: Issue analysis. The agent reads the issue description and extracts key information: what is broken, what the expected behavior should be, any error messages or stack traces mentioned, and any hints about which part of the codebase is involved.

Step 2: Codebase exploration. The agent navigates the repository to build context. This typically involves:

Listing directory contents to understand project structure
Reading relevant files identified from the issue description
Searching for function definitions, class names, or error messages mentioned in the issue
Examining test files to understand expected behavior
Using grep-like tools to find related code across the repository

Step 3: Root cause identification. Based on the issue description and code exploration, the agent identifies the specific code that needs to change. For a bug fix, this means tracing the execution path that leads to the bug. For a feature request, it means understanding where in the codebase the new functionality should be added.

Step 4: Bug reproduction (advanced agents). More sophisticated agents attempt to reproduce the bug before fixing it. They write a small script that triggers the reported behavior, run it to confirm the failure, and use the script later to verify their fix. This step significantly improves patch quality.

Step 5: Patch generation. The agent writes the actual code changes as a unified diff patch. This often requires changes to multiple files and careful handling of edge cases.

Step 6: Self-testing and refinement. The most effective agents run the test suite (or relevant subsets) before submitting their patch, using test results to iteratively refine their solution. If tests fail, the agent analyzes the failure, modifies the patch, and retries. This iterative loop can run for multiple cycles.

Agent frameworks and tools

Several agent frameworks have been developed specifically for SWE-bench:

Framework	Developer	Key Features
SWE-agent	Princeton/Stanford	Custom shell interface, constrained editing with linting, context management
Devin	Cognition Labs	Full development environment, browser access, long-term planning
OpenHands (formerly OpenDevin)	Open source community	Modular architecture, multiple tool backends
Auggie CLI	Augment Code	Proprietary scaffolding, optimized for long-horizon tasks
Mini-SWE-agent	Princeton	Minimalist 100-line agent using only bash, scores >74% on Verified

A notable recent development is the minimalist approach exemplified by Mini-SWE-agent, which achieves competitive scores (over 74% on SWE-bench Verified) using only bash commands and no custom tools. This demonstrates that the quality of the underlying language model matters more than the sophistication of the tool framework.

The role of scaffolding

The same language model can achieve dramatically different SWE-bench scores depending on the agent scaffolding wrapped around it. Scaffolding differences include:

Tool design: What commands the agent can use (file viewing, editing, searching, testing)
Context management: How the agent manages its limited context window across a large codebase
Retry strategies: How many attempts the agent gets and how it uses failure information
Planning mechanisms: Whether the agent creates explicit plans before acting
Cost allocation: How much compute budget is allocated per task

This scaffolding variance is one of SWE-bench's most significant confounds: it is difficult to isolate the contribution of the model from the contribution of the scaffolding. SWE-bench Pro's SEAL (Standardized Evaluation for Agentic LLMs) framework partially addresses this by providing a common scaffolding that all models are evaluated with.

SWE-bench Lite

SWE-bench Lite is a 300-instance subset of the full benchmark, created to make evaluation faster and more accessible. It focuses on functional bug fixes (excluding feature additions) and covers 11 of the 12 original repositories. SWE-bench Lite was popular in early evaluations but has since been superseded by SWE-bench Verified.

SWE-bench Verified

In August 2024, OpenAI collaborated with the SWE-bench team to create SWE-bench Verified, a human-validated subset of 500 instances. Professional software developers reviewed each task to ensure:

The issue description is clear enough for a competent developer to understand what needs to be fixed.
The test patches correctly validate the expected fix.
The task is actually solvable given only the information in the issue description.

This human validation was important because the original SWE-bench dataset contained some problematic instances: issues with ambiguous descriptions, tests that did not properly validate the fix, or tasks that required information not present in the issue text. By filtering these out, SWE-bench Verified provided a cleaner evaluation signal.

SWE-bench Verified became the most widely reported version of the benchmark from late 2024 through 2025.

SWE-bench Verified leaderboard

The following table shows the top-performing systems on SWE-bench Verified as of early 2026.

Rank	System	Organization	Resolve Rate
1	Claude Opus 4.5 (with agent scaffold)	Anthropic	80.9%
2	Claude Opus 4.6	Anthropic	80.8%
3	Gemini 3.1 Pro	Google DeepMind	80.6%
4	MiniMax M2.5	MiniMax	80.2%
5	GPT-5.2	OpenAI	80.0%
6	Claude Sonnet 4.6	Anthropic	79.6%
7	Sonar Foundation Agent	SonarSource	79.2%
8	Gemini 3 Flash	Google DeepMind	78.0%
9	GLM-5	Zhipu AI	77.8%
10	Kimi K2.5	Moonshot AI	76.8%

Scores on SWE-bench Verified have risen dramatically since the benchmark's introduction. Early 2024 saw top scores around 20-30%; by mid-2025, multiple systems had crossed 70%; and by early 2026, the top cluster of models was packed tightly between 76% and 81%.

Model performance progression on SWE-bench

The following table traces the evolution of top scores across SWE-bench variants from 2023 to 2026:

Date	System	Variant	Score	Significance
Oct 2023	Claude 2 + SWE-agent	Full	3.79%	Initial baseline
Mar 2024	Devin	Lite	13.86%	Generated massive media attention
Apr 2024	SWE-agent + GPT-4	Lite	18.0%	Open-source agent framework
Aug 2024	OpenAI internal	Verified	~33%	Launched Verified subset
Oct 2024	Claude 3.5 Sonnet (new)	Verified	49.0%	First to cross 49% on Verified
Jan 2025	Multiple systems	Verified	~60-65%	Rapid improvement wave
Mid 2025	Multiple systems	Verified	~70-75%	Approaching saturation
Early 2026	Claude Opus 4.5	Verified	80.9%	Current state-of-the-art

The progression from 3.79% to over 80% in approximately two years represents one of the fastest capability improvements in AI benchmark history.

Contamination concerns

A significant caveat emerged in late 2025 when researchers discovered evidence of data contamination in SWE-bench Verified. OpenAI reported that every frontier model tested (including GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash) could reproduce verbatim gold patches or problem-statement-specific details for certain SWE-bench Verified tasks. This means that some of the benchmark's tasks (or very similar code) appeared in the models' training data, artificially inflating scores.

Specific contamination evidence

OpenAI's investigation revealed several concrete examples of contamination:

The django__django-14725 case. In this task, the tests require a specific new parameter called edit_only which is not explicitly required by the problem statement. While solving the problem, GPT-5.2 showed in its chain of thought that it had information about the Django release notes that detail changes to the codebase, correctly identifying that the edit_only parameter was introduced in Django 4.1. This knowledge could only have come from training data, not from the issue description.

Verbatim patch reproduction. GPT-5.2 was observed outputting exact gold patches from short snippets of task descriptions, including exact class and method names and specific implementation details (like new early return conditions) that were not derivable from the issue alone.

Systematic assessment. To assess the scale of contamination, OpenAI created an automated red-teaming setup, tasking GPT-5 with probing GPT-5.2-Chat, Claude Opus 4.5, and Gemini 3 Flash Preview for contamination across each SWE-bench Verified question. The results showed that GPT-5.2 could solve 31 tasks that OpenAI identified as "almost impossible to solve" without contamination.

Additional quality concerns

Beyond contamination, analysis revealed other issues with SWE-bench Verified scores:

33% of successful patches in sampled instances directly reproduced code snippets or patches already present verbatim in the issue description or comments. After excluding these "copy-paste" solutions, true resolution rates dropped by nearly half for some systems (for example, from 18% to 9% for SWE-Agent+GPT-4).
14.8% of initially "resolved" instances passed due to insufficient test coverage rather than correct bug fixes. Model-generated patches could be semantically incorrect yet survive official tests.

This contamination finding contributed to the development of SWE-bench Pro, which was designed from the ground up to be contamination-resistant.

SWE-bench Pro

In late 2025, Scale AI introduced SWE-bench Pro, a next-generation software engineering benchmark that addresses several fundamental limitations of the original SWE-bench and SWE-bench Verified.

Design and scope

SWE-bench Pro contains 1,865 tasks sourced from 41 actively maintained repositories spanning 123 programming languages. This is a major expansion from the original benchmark's 12 Python-only repositories. The benchmark is partitioned into three subsets:

Subset	Tasks	Repositories	Description
Public	731	11 open-source (GPL-licensed)	Freely accessible for research
Held-out	858	12 repositories	Reserved for leaderboard evaluation
Private (Commercial)	276	18 proprietary startup codebases	Partnership with private companies

The use of GPL-licensed code for the public set is a deliberate contamination-resistance strategy: the strong copyleft license acts as a legal deterrent against including this code in model training data.

Key improvements over Verified

SWE-bench Pro addresses four limitations that plagued earlier versions:

Data contamination. By using fresh repositories, copyleft licensing, and a private commercial subset, the benchmark minimizes the chance that models have seen the tasks during training.
Limited task diversity. The expansion from 12 Python repositories to 41 multi-language repositories covers a much broader range of real software engineering work.
Oversimplified problems. SWE-bench Pro includes "long-horizon" tasks that might take a professional software engineer hours or even days to complete, involving patches across multiple files and substantial code changes.
Unreliable testing. All tasks are human-verified, and the augmentation process adds sufficient context to ensure each task is genuinely resolvable.

SWE-bench Pro leaderboard

Performance on SWE-bench Pro is substantially lower than on Verified, reflecting both the harder tasks and the absence of contamination. The following table shows the top systems on the public dataset with standardized scaffolding (SEAL) as of early 2026.

Rank	System	Organization	Resolve Rate
1	Claude Opus 4.5	Anthropic	45.9%
2	Claude 4.5 Sonnet	Anthropic	43.6%
3	Gemini 3 Pro Preview	Google DeepMind	43.3%
4	Claude 4 Sonnet	Anthropic	42.7%
5	GPT-5 (High)	OpenAI	41.8%
6	GPT-5.2 Codex	OpenAI	41.0%
7	Claude 4.5 Haiku	Anthropic	39.5%
8	Qwen3-Coder 480B	Alibaba	38.7%
9	MiniMax 2.1	MiniMax	36.8%
10	Gemini 3 Flash	Google DeepMind	34.6%

The fact that the best models solve about 46% of SWE-bench Pro tasks versus over 80% on Verified illustrates both the harder nature of the tasks and the likely contamination effects on the older benchmark.

With optimized agent scaffolding beyond the standardized SEAL evaluation, some systems have achieved higher scores. Augment Code's Auggie CLI reported 51.8% on SWE-bench Pro, and OpenAI reported 57% for GPT-5.3 Codex with their internal scaffolding.

Impact on AI coding evaluation

SWE-bench fundamentally changed how the AI community thinks about code generation evaluation. Before SWE-bench, the field was dominated by function-level benchmarks like HumanEval and MBPP, which tested a narrow and artificial skill. SWE-bench showed that real software engineering is qualitatively different from writing isolated functions, and it provided the first rigorous way to measure progress on this harder task.

Several specific impacts stand out:

Agentic evaluation. SWE-bench was one of the first benchmarks to require an agent (not just a model) to interact with an environment over multiple steps. This pushed the field toward building and evaluating AI agents rather than just language models.

Industry adoption. SWE-bench scores became a standard metric in press releases and marketing materials for AI coding products. When Cognition Labs launched Devin in early 2024 with a SWE-bench score of 13.86%, it generated enormous attention. When Claude 3.5 Sonnet achieved over 49% later that year, it was headline news. The benchmark gave companies a concrete, legible number to compete on.

Research direction. SWE-bench influenced research priorities by highlighting that code generation was only part of the problem. Issues like repository understanding, long-context reasoning, multi-file editing, and test-driven development became active research areas largely because SWE-bench made them measurable.

Benchmark evolution. The trajectory from SWE-bench to SWE-bench Lite to SWE-bench Verified to SWE-bench Pro illustrates a broader pattern in AI evaluation: benchmarks get saturated or contaminated, prompting the creation of harder and cleaner successors. Each iteration has pushed the frontier of what AI coding evaluation means.

Impact on the AI coding industry

SWE-bench has had a direct influence on the commercial AI coding landscape:

Product development. Companies building AI coding assistants (Cursor, GitHub Copilot, Augment Code, Codeium, and others) use SWE-bench as a north-star metric for agent capability, driving investment in agentic features beyond simple code completion.
Investment decisions. Venture capital firms have cited SWE-bench scores when evaluating AI coding startups, making the benchmark a factor in funding decisions.
Hiring and evaluation. Some companies use SWE-bench-style evaluations to assess their AI coding systems internally, running real issue resolution tasks to measure improvement over time.
Open-source development. The availability of SWE-agent and OpenHands as open-source frameworks has democratized access to SWE-bench evaluation, allowing smaller labs and individual researchers to contribute.

SWE-bench exists within a growing ecosystem of software engineering benchmarks.

Benchmark	Focus	Key difference from SWE-bench
HumanEval	Single-function code generation	Much simpler, isolated problems
GPQA	Graduate-level science Q&A	Tests knowledge, not coding
SWE-bench Multimodal	GUI-based software tasks	Includes visual understanding
SWE-bench Live	Continuously updated tasks	Resists contamination via freshness
SWE-rebench	Independent reproduction	Verifies reproducibility of scores
BigCodeBench	Complex function calls with libraries	Tests API and tool usage
SWE-EVO	Evolving software benchmark	Tests adaptation to changing codebases

Criticism and limitations

Despite its influence, SWE-bench has faced criticism on several fronts.

Self-reported scores. The SWE-bench leaderboard relies heavily on self-reported results. As of early 2026, none of the 77 entries on the SWE-bench Verified leaderboard were independently verified. This creates opportunities for cherry-picking favorable configurations or evaluation conditions.

Scaffolding variance. Performance on SWE-bench depends heavily on the agent scaffolding, not just the underlying model. The same model can achieve very different scores depending on how it explores the repository, how many attempts it gets, and how it uses tool calls. This makes it hard to compare models fairly unless the scaffolding is standardized (which SWE-bench Pro's SEAL evaluation partially addresses).

Cost sensitivity. Some top scores require running each task through expensive extended reasoning or multiple retry loops, costing several dollars per task. Whether a system that spends $50 to fix a bug is practically useful is a separate question from whether it can do it at all.

Python bias (original). The original SWE-bench and SWE-bench Verified are Python-only, which means they do not test performance on the many languages used in real software development. SWE-bench Pro addresses this by spanning 123 languages.

Django dominance. With Django accounting for 37-46% of tasks depending on the variant, SWE-bench scores are disproportionately influenced by a model's ability to work with one specific codebase. A model that is excellent at Django but mediocre at everything else could achieve a misleadingly high score.

Issue description quality. Some issue descriptions in the original dataset are vague, reference external resources, or assume context that only a project maintainer would have. While SWE-bench Verified addressed the worst cases, the quality of issue descriptions remains variable.

Current state

As of early 2026, SWE-bench remains the most influential benchmark for AI software engineering. The landscape includes several active variants:

SWE-bench Verified is still widely reported but increasingly seen as contaminated and approaching saturation.
SWE-bench Pro is the newest and most rigorous version, with significantly lower scores that provide more room for future progress.
SWE-bench Live provides continuously updated tasks drawn from recent GitHub activity to resist contamination.
SWE-rebench offers independent reproduction of published scores.

The benchmark's trajectory mirrors the broader story of AI coding: from a research curiosity in 2023, when the best models solved less than 4% of tasks, to a competitive arena where frontier systems routinely solve 40-80% of tasks depending on the variant. The gap between SWE-bench Verified scores (80%+) and SWE-bench Pro scores (40-50%) serves as a useful reminder that headline benchmark numbers should be interpreted with care.

References

Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., & Narasimhan, K. (2024). "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" *International Conference on Learning Representations (ICLR 2024)*. https://arxiv.org/abs/2310.06770
OpenAI. (2024). "Introducing SWE-bench Verified." https://openai.com/index/introducing-swe-bench-verified/
OpenAI. (2025). "Why SWE-bench Verified no longer measures frontier coding capabilities." https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/
Jimenez, C. E. et al. (2025). "SWE-bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?" *arXiv preprint arXiv:2509.16941*. https://arxiv.org/abs/2509.16941
Scale AI. (2025). "SWE-bench Pro: Raising the Bar for Agentic Coding." https://scale.com/blog/swe-bench-pro
Yang, J., Jimenez, C. E., et al. (2024). "SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering." *NeurIPS 2024*. https://github.com/SWE-agent/SWE-agent
SWE-bench Leaderboards. https://www.swebench.com/
Scale Labs. "SWE-Bench Pro Leaderboard." https://labs.scale.com/leaderboard/swe_bench_pro_public
Princeton Language and Intelligence. "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" https://pli.princeton.edu/blog/2023/swe-bench-can-language-models-resolve-real-world-github-issues
SWE-bench GitHub Repository. https://github.com/SWE-bench/SWE-bench
Cognition Labs. "SWE-bench Technical Report." https://cognition.ai/blog/swe-bench-technical-report

SWE-bench

Background and motivation

Dataset construction

Repository selection

Task distribution analysis

Data scraping and filtering

Task format

Task difficulty spectrum

How AI agents solve SWE-bench tasks

The typical agent workflow

Agent frameworks and tools

The role of scaffolding

SWE-bench Lite

SWE-bench Verified

SWE-bench Verified leaderboard

Model performance progression on SWE-bench

Contamination concerns

Specific contamination evidence

Additional quality concerns

SWE-bench Pro

Design and scope

Key improvements over Verified

SWE-bench Pro leaderboard

Impact on AI coding evaluation

Impact on the AI coding industry

Related benchmarks

Criticism and limitations

Current state

References

Related Articles

GPQA

HumanEval

Chatbot Arena

AA-LCR

CharXiv

GSO

SWE-bench

Background and motivation

Dataset construction

Repository selection

Task distribution analysis

Data scraping and filtering

Task format

Task difficulty spectrum

How AI agents solve SWE-bench tasks

The typical agent workflow

Agent frameworks and tools

The role of scaffolding

SWE-bench Lite

SWE-bench Verified

SWE-bench Verified leaderboard

Model performance progression on SWE-bench

Contamination concerns

Specific contamination evidence

Additional quality concerns

SWE-bench Pro

Design and scope

Key improvements over Verified

SWE-bench Pro leaderboard

Impact on AI coding evaluation

Impact on the AI coding industry

Related benchmarks

Criticism and limitations

Current state

References

Related Articles

GPQA

HumanEval

Chatbot Arena

AA-LCR

CharXiv

GSO