SWE-bench (Software Engineering Benchmark) is an evaluation framework that tests whether AI systems can resolve real-world software engineering tasks drawn from actual GitHub issues and pull requests. Created by Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan at Princeton University's Language and Intelligence group, SWE-bench was introduced in October 2023 and published at ICLR 2024. The original dataset contains 2,294 task instances collected from 12 popular open-source Python repositories. Unlike code generation benchmarks such as HumanEval that test whether a model can write a single function, SWE-bench tests whether an AI agent can read a bug report or feature request, navigate a large codebase, identify the relevant files and functions, write a correct patch, and have that patch pass the project's test suite. This end-to-end evaluation makes SWE-bench one of the most realistic and demanding benchmarks for AI-assisted software engineering.
By 2023, AI code generation had made impressive progress on isolated programming problems. Models were scoring above 80% on HumanEval and tackling competitive programming challenges. But anyone who had used an AI coding assistant on a real software project knew there was a large gap between writing a single function from a docstring and doing actual software engineering.
Real software engineering involves understanding existing codebases with thousands of files, tracking down bugs that span multiple modules, writing fixes that do not break other functionality, and following project-specific conventions and patterns. None of the existing benchmarks captured this complexity.
Jimenez and colleagues at Princeton set out to create a benchmark that would bridge this gap. Their insight was to use the natural workflow of open-source development as a source of tasks: every resolved GitHub issue, paired with its fix and the tests that verify the fix, is essentially a software engineering problem with a known solution and an automatic grading rubric.
The construction of SWE-bench involved several stages.
The team selected 12 popular, actively maintained, well-tested Python repositories spanning different domains:
| Repository | Domain | Issues in dataset |
|---|---|---|
| django/django | Web framework | 850 |
| sympy/sympy | Symbolic mathematics | 386 |
| scikit-learn/scikit-learn | Machine learning | 229 |
| sphinx-doc/sphinx | Documentation generator | 187 |
| matplotlib/matplotlib | Data visualization | 184 |
| pytest-dev/pytest | Testing framework | 119 |
| pydata/xarray | Labeled arrays | 110 |
| astropy/astropy | Astronomy | 95 |
| pylint-dev/pylint | Code linting | 57 |
| psf/requests | HTTP library | 44 |
| mwaskom/seaborn | Statistical visualization | 22 |
| pallets/flask | Web microframework | 11 |
These repositories were chosen because they are large (tens of thousands to hundreds of thousands of lines of code), well-maintained, and have comprehensive test suites.
The dataset is heavily skewed toward a few repositories. Django alone accounts for 37% of all tasks, and the top three repositories (Django, SymPy, and scikit-learn) account for nearly 64% of the dataset. This concentration has implications for evaluation: a model's SWE-bench score is disproportionately influenced by its ability to work with Django's codebase, conventions, and testing patterns.
| Repository | % of total dataset | % of SWE-bench Verified |
|---|---|---|
| django/django | 37.1% | 46.2% |
| sympy/sympy | 16.8% | 15.0% |
| scikit-learn/scikit-learn | 10.0% | 6.4% |
| sphinx-doc/sphinx | 8.1% | 8.8% |
| matplotlib/matplotlib | 8.0% | 6.8% |
| All others | 20.0% | 16.8% |
The skew is even more pronounced in SWE-bench Verified, where Django accounts for 46.2% of tasks. This means that a model's SWE-bench Verified score is nearly half determined by its Django proficiency.
The team scraped roughly 90,000 pull requests from these 12 repositories. Each pull request was analyzed to identify whether it resolved one or more GitHub issues and whether it included new or modified unit tests that validated the fix. Task instances were retained only if they met both criteria: the PR had to reference a resolved issue, and it had to include test changes that could serve as the evaluation criterion.
After filtering, 2,294 task instances remained. Each task instance includes:
When an AI agent tackles a SWE-bench task, it receives the issue description and access to the repository at the pre-fix commit. The agent must:
The agent's output is a unified diff patch. Evaluation is fully automated: the patch is applied to the repository, the test suite is run in a Docker container, and the task is marked as resolved only if all relevant tests pass.
SWE-bench tasks span a wide range of difficulty:
| Difficulty Level | Characteristics | Approximate proportion |
|---|---|---|
| Simple bug fixes | Single-line changes, obvious from issue description | ~15-20% |
| Moderate fixes | Multi-line changes in one file, requires code understanding | ~35-40% |
| Complex fixes | Changes across multiple files, requires deep codebase knowledge | ~25-30% |
| Feature additions | New functionality requiring architectural understanding | ~15-20% |
The simplest tasks might involve fixing a typo in an error message or correcting an off-by-one error. The hardest tasks might require understanding the interaction between multiple subsystems, adding new configuration options, or refactoring existing code to support a new use case.
SWE-bench tasks require more than a language model generating code. They require an agent that can interact with a codebase, explore files, run tests, and iterate on solutions. Understanding these agent architectures is essential for interpreting SWE-bench scores.
A SWE-bench agent follows a workflow that mirrors how a human developer approaches an unfamiliar bug report:
Step 1: Issue analysis. The agent reads the issue description and extracts key information: what is broken, what the expected behavior should be, any error messages or stack traces mentioned, and any hints about which part of the codebase is involved.
Step 2: Codebase exploration. The agent navigates the repository to build context. This typically involves:
Step 3: Root cause identification. Based on the issue description and code exploration, the agent identifies the specific code that needs to change. For a bug fix, this means tracing the execution path that leads to the bug. For a feature request, it means understanding where in the codebase the new functionality should be added.
Step 4: Bug reproduction (advanced agents). More sophisticated agents attempt to reproduce the bug before fixing it. They write a small script that triggers the reported behavior, run it to confirm the failure, and use the script later to verify their fix. This step significantly improves patch quality.
Step 5: Patch generation. The agent writes the actual code changes as a unified diff patch. This often requires changes to multiple files and careful handling of edge cases.
Step 6: Self-testing and refinement. The most effective agents run the test suite (or relevant subsets) before submitting their patch, using test results to iteratively refine their solution. If tests fail, the agent analyzes the failure, modifies the patch, and retries. This iterative loop can run for multiple cycles.
Several agent frameworks have been developed specifically for SWE-bench:
| Framework | Developer | Key Features |
|---|---|---|
| SWE-agent | Princeton/Stanford | Custom shell interface, constrained editing with linting, context management |
| Devin | Cognition Labs | Full development environment, browser access, long-term planning |
| OpenHands (formerly OpenDevin) | Open source community | Modular architecture, multiple tool backends |
| Auggie CLI | Augment Code | Proprietary scaffolding, optimized for long-horizon tasks |
| Mini-SWE-agent | Princeton | Minimalist 100-line agent using only bash, scores >74% on Verified |
A notable recent development is the minimalist approach exemplified by Mini-SWE-agent, which achieves competitive scores (over 74% on SWE-bench Verified) using only bash commands and no custom tools. This demonstrates that the quality of the underlying language model matters more than the sophistication of the tool framework.
The same language model can achieve dramatically different SWE-bench scores depending on the agent scaffolding wrapped around it. Scaffolding differences include:
This scaffolding variance is one of SWE-bench's most significant confounds: it is difficult to isolate the contribution of the model from the contribution of the scaffolding. SWE-bench Pro's SEAL (Standardized Evaluation for Agentic LLMs) framework partially addresses this by providing a common scaffolding that all models are evaluated with.
SWE-bench Lite is a 300-instance subset of the full benchmark, created to make evaluation faster and more accessible. It focuses on functional bug fixes (excluding feature additions) and covers 11 of the 12 original repositories. SWE-bench Lite was popular in early evaluations but has since been superseded by SWE-bench Verified.
In August 2024, OpenAI collaborated with the SWE-bench team to create SWE-bench Verified, a human-validated subset of 500 instances. Professional software developers reviewed each task to ensure:
This human validation was important because the original SWE-bench dataset contained some problematic instances: issues with ambiguous descriptions, tests that did not properly validate the fix, or tasks that required information not present in the issue text. By filtering these out, SWE-bench Verified provided a cleaner evaluation signal.
SWE-bench Verified became the most widely reported version of the benchmark from late 2024 through 2025.
The following table shows the top-performing systems on SWE-bench Verified as of early 2026.
| Rank | System | Organization | Resolve Rate |
|---|---|---|---|
| 1 | Claude Opus 4.5 (with agent scaffold) | Anthropic | 80.9% |
| 2 | Claude Opus 4.6 | Anthropic | 80.8% |
| 3 | Gemini 3.1 Pro | Google DeepMind | 80.6% |
| 4 | MiniMax M2.5 | MiniMax | 80.2% |
| 5 | GPT-5.2 | OpenAI | 80.0% |
| 6 | Claude Sonnet 4.6 | Anthropic | 79.6% |
| 7 | Sonar Foundation Agent | SonarSource | 79.2% |
| 8 | Gemini 3 Flash | Google DeepMind | 78.0% |
| 9 | GLM-5 | Zhipu AI | 77.8% |
| 10 | Kimi K2.5 | Moonshot AI | 76.8% |
Scores on SWE-bench Verified have risen dramatically since the benchmark's introduction. Early 2024 saw top scores around 20-30%; by mid-2025, multiple systems had crossed 70%; and by early 2026, the top cluster of models was packed tightly between 76% and 81%.
The following table traces the evolution of top scores across SWE-bench variants from 2023 to 2026:
| Date | System | Variant | Score | Significance |
|---|---|---|---|---|
| Oct 2023 | Claude 2 + SWE-agent | Full | 3.79% | Initial baseline |
| Mar 2024 | Devin | Lite | 13.86% | Generated massive media attention |
| Apr 2024 | SWE-agent + GPT-4 | Lite | 18.0% | Open-source agent framework |
| Aug 2024 | OpenAI internal | Verified | ~33% | Launched Verified subset |
| Oct 2024 | Claude 3.5 Sonnet (new) | Verified | 49.0% | First to cross 49% on Verified |
| Jan 2025 | Multiple systems | Verified | ~60-65% | Rapid improvement wave |
| Mid 2025 | Multiple systems | Verified | ~70-75% | Approaching saturation |
| Early 2026 | Claude Opus 4.5 | Verified | 80.9% | Current state-of-the-art |
The progression from 3.79% to over 80% in approximately two years represents one of the fastest capability improvements in AI benchmark history.
A significant caveat emerged in late 2025 when researchers discovered evidence of data contamination in SWE-bench Verified. OpenAI reported that every frontier model tested (including GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash) could reproduce verbatim gold patches or problem-statement-specific details for certain SWE-bench Verified tasks. This means that some of the benchmark's tasks (or very similar code) appeared in the models' training data, artificially inflating scores.
OpenAI's investigation revealed several concrete examples of contamination:
The django__django-14725 case. In this task, the tests require a specific new parameter called edit_only which is not explicitly required by the problem statement. While solving the problem, GPT-5.2 showed in its chain of thought that it had information about the Django release notes that detail changes to the codebase, correctly identifying that the edit_only parameter was introduced in Django 4.1. This knowledge could only have come from training data, not from the issue description.
Verbatim patch reproduction. GPT-5.2 was observed outputting exact gold patches from short snippets of task descriptions, including exact class and method names and specific implementation details (like new early return conditions) that were not derivable from the issue alone.
Systematic assessment. To assess the scale of contamination, OpenAI created an automated red-teaming setup, tasking GPT-5 with probing GPT-5.2-Chat, Claude Opus 4.5, and Gemini 3 Flash Preview for contamination across each SWE-bench Verified question. The results showed that GPT-5.2 could solve 31 tasks that OpenAI identified as "almost impossible to solve" without contamination.
Beyond contamination, analysis revealed other issues with SWE-bench Verified scores:
This contamination finding contributed to the development of SWE-bench Pro, which was designed from the ground up to be contamination-resistant.
In late 2025, Scale AI introduced SWE-bench Pro, a next-generation software engineering benchmark that addresses several fundamental limitations of the original SWE-bench and SWE-bench Verified.
SWE-bench Pro contains 1,865 tasks sourced from 41 actively maintained repositories spanning 123 programming languages. This is a major expansion from the original benchmark's 12 Python-only repositories. The benchmark is partitioned into three subsets:
| Subset | Tasks | Repositories | Description |
|---|---|---|---|
| Public | 731 | 11 open-source (GPL-licensed) | Freely accessible for research |
| Held-out | 858 | 12 repositories | Reserved for leaderboard evaluation |
| Private (Commercial) | 276 | 18 proprietary startup codebases | Partnership with private companies |
The use of GPL-licensed code for the public set is a deliberate contamination-resistance strategy: the strong copyleft license acts as a legal deterrent against including this code in model training data.
SWE-bench Pro addresses four limitations that plagued earlier versions:
Performance on SWE-bench Pro is substantially lower than on Verified, reflecting both the harder tasks and the absence of contamination. The following table shows the top systems on the public dataset with standardized scaffolding (SEAL) as of early 2026.
| Rank | System | Organization | Resolve Rate |
|---|---|---|---|
| 1 | Claude Opus 4.5 | Anthropic | 45.9% |
| 2 | Claude 4.5 Sonnet | Anthropic | 43.6% |
| 3 | Gemini 3 Pro Preview | Google DeepMind | 43.3% |
| 4 | Claude 4 Sonnet | Anthropic | 42.7% |
| 5 | GPT-5 (High) | OpenAI | 41.8% |
| 6 | GPT-5.2 Codex | OpenAI | 41.0% |
| 7 | Claude 4.5 Haiku | Anthropic | 39.5% |
| 8 | Qwen3-Coder 480B | Alibaba | 38.7% |
| 9 | MiniMax 2.1 | MiniMax | 36.8% |
| 10 | Gemini 3 Flash | Google DeepMind | 34.6% |
The fact that the best models solve about 46% of SWE-bench Pro tasks versus over 80% on Verified illustrates both the harder nature of the tasks and the likely contamination effects on the older benchmark.
With optimized agent scaffolding beyond the standardized SEAL evaluation, some systems have achieved higher scores. Augment Code's Auggie CLI reported 51.8% on SWE-bench Pro, and OpenAI reported 57% for GPT-5.3 Codex with their internal scaffolding.
SWE-bench fundamentally changed how the AI community thinks about code generation evaluation. Before SWE-bench, the field was dominated by function-level benchmarks like HumanEval and MBPP, which tested a narrow and artificial skill. SWE-bench showed that real software engineering is qualitatively different from writing isolated functions, and it provided the first rigorous way to measure progress on this harder task.
Several specific impacts stand out:
Agentic evaluation. SWE-bench was one of the first benchmarks to require an agent (not just a model) to interact with an environment over multiple steps. This pushed the field toward building and evaluating AI agents rather than just language models.
Industry adoption. SWE-bench scores became a standard metric in press releases and marketing materials for AI coding products. When Cognition Labs launched Devin in early 2024 with a SWE-bench score of 13.86%, it generated enormous attention. When Claude 3.5 Sonnet achieved over 49% later that year, it was headline news. The benchmark gave companies a concrete, legible number to compete on.
Research direction. SWE-bench influenced research priorities by highlighting that code generation was only part of the problem. Issues like repository understanding, long-context reasoning, multi-file editing, and test-driven development became active research areas largely because SWE-bench made them measurable.
Benchmark evolution. The trajectory from SWE-bench to SWE-bench Lite to SWE-bench Verified to SWE-bench Pro illustrates a broader pattern in AI evaluation: benchmarks get saturated or contaminated, prompting the creation of harder and cleaner successors. Each iteration has pushed the frontier of what AI coding evaluation means.
SWE-bench has had a direct influence on the commercial AI coding landscape:
SWE-bench exists within a growing ecosystem of software engineering benchmarks.
| Benchmark | Focus | Key difference from SWE-bench |
|---|---|---|
| HumanEval | Single-function code generation | Much simpler, isolated problems |
| GPQA | Graduate-level science Q&A | Tests knowledge, not coding |
| SWE-bench Multimodal | GUI-based software tasks | Includes visual understanding |
| SWE-bench Live | Continuously updated tasks | Resists contamination via freshness |
| SWE-rebench | Independent reproduction | Verifies reproducibility of scores |
| BigCodeBench | Complex function calls with libraries | Tests API and tool usage |
| SWE-EVO | Evolving software benchmark | Tests adaptation to changing codebases |
Despite its influence, SWE-bench has faced criticism on several fronts.
Self-reported scores. The SWE-bench leaderboard relies heavily on self-reported results. As of early 2026, none of the 77 entries on the SWE-bench Verified leaderboard were independently verified. This creates opportunities for cherry-picking favorable configurations or evaluation conditions.
Scaffolding variance. Performance on SWE-bench depends heavily on the agent scaffolding, not just the underlying model. The same model can achieve very different scores depending on how it explores the repository, how many attempts it gets, and how it uses tool calls. This makes it hard to compare models fairly unless the scaffolding is standardized (which SWE-bench Pro's SEAL evaluation partially addresses).
Cost sensitivity. Some top scores require running each task through expensive extended reasoning or multiple retry loops, costing several dollars per task. Whether a system that spends $50 to fix a bug is practically useful is a separate question from whether it can do it at all.
Python bias (original). The original SWE-bench and SWE-bench Verified are Python-only, which means they do not test performance on the many languages used in real software development. SWE-bench Pro addresses this by spanning 123 languages.
Django dominance. With Django accounting for 37-46% of tasks depending on the variant, SWE-bench scores are disproportionately influenced by a model's ability to work with one specific codebase. A model that is excellent at Django but mediocre at everything else could achieve a misleadingly high score.
Issue description quality. Some issue descriptions in the original dataset are vague, reference external resources, or assume context that only a project maintainer would have. While SWE-bench Verified addressed the worst cases, the quality of issue descriptions remains variable.
As of early 2026, SWE-bench remains the most influential benchmark for AI software engineering. The landscape includes several active variants:
The benchmark's trajectory mirrors the broader story of AI coding: from a research curiosity in 2023, when the best models solved less than 4% of tasks, to a competitive arena where frontier systems routinely solve 40-80% of tasks depending on the variant. The gap between SWE-bench Verified scores (80%+) and SWE-bench Pro scores (40-50%) serves as a useful reminder that headline benchmark numbers should be interpreted with care.