SWE-Bench Pro
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,395 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,395 words
Add missing citations, update stale details, or suggest a clearer explanation.
SWE-Bench Pro (stylized SWE-BENCH PRO) is a benchmark for evaluating the long-horizon software engineering capabilities of AI agents, developed and released by Scale AI in September 2025. Designed as a successor to the influential SWE-bench and SWE-bench Verified benchmarks, SWE-Bench Pro contains 1,865 problems sourced from 41 actively maintained repositories spanning business applications, B2B services, and developer tools, and is partitioned into a public set, a held-out set, and a commercial set of proprietary codebases.[^1][^2] The benchmark was constructed explicitly to address two problems that had eroded the signal of earlier coding benchmarks by late 2025: training-data contamination of public test suites, and saturation by frontier models. Tasks in SWE-Bench Pro are described as long-horizon problems that "may require hours to days for a professional software engineer to complete," with reference solutions touching an average of 4.1 files and modifying around 107 lines of code.[^1][^3]
Following an OpenAI audit of SWE-bench Verified in early 2026 that identified widespread test flaws and evidence of contamination, OpenAI publicly stopped reporting Verified scores and recommended that frontier-model developers report SWE-Bench Pro instead.[^4][^5] By mid-2026, SWE-Bench Pro had become a de facto industry standard for measuring frontier coding-agent capability, with the Scale Labs leaderboard tracking submissions from major laboratories including Anthropic, OpenAI, and Google DeepMind across both the public and commercial dataset splits.[^6][^7]
The original SWE-bench was released in October 2023 by Princeton University researchers and introduced 2,294 issue-resolution tasks drawn from twelve popular Python open-source repositories on GitHub. Each instance pairs a real GitHub issue and its associated codebase with one or more test cases that verify that a candidate patch resolves the issue without breaking existing functionality.[^8] SWE-bench rapidly became a flagship evaluation for measuring the practical coding capabilities of large language models, because unlike contemporary code benchmarks focused on isolated function synthesis, it required agents to navigate a multi-file repository, identify the relevant code, and produce a working patch.
In August 2024, OpenAI released SWE-bench Verified, a 500-instance subset of SWE-bench filtered through human annotation to remove problems with under-specified issues or overly restrictive tests. SWE-bench Verified became the most widely cited coding benchmark across vendor announcements through 2024 and 2025, with reported scores climbing from below 30 percent in early 2024 to above 80 percent for top frontier models by mid-2025.[^4]
By 2025, several issues with the SWE-bench family were becoming acute. The first was saturation: top scores on SWE-bench Verified rose from roughly 75 percent to over 80 percent over a six-month window, and the spread between frontier laboratories' systems compressed to within a few percentage points, limiting the benchmark's ability to differentiate truly capable systems from incrementally improved ones.[^4][^9] Vendor announcements increasingly relied on small differences in reported Verified scores, while independent reproductions varied by 10 percentage points or more depending on the agent scaffolding used to drive the underlying model.
The second issue was contamination. Because SWE-bench Verified is a fixed, fully public set of 500 problems drawn from open-source repositories that have been hosted on GitHub for years, the benchmark's problem statements, repository contents, and gold-standard patches are all present in the data used to pretrain modern language models. An audit performed by OpenAI's evaluation team and published in February 2026 reported that contemporary frontier models, including GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash Preview, could reproduce original gold-patch solutions from training memory when given only a SWE-bench Verified task identifier as a prompt.[^4][^5] The same audit found that a sample of the most challenging "hard" Verified tasks contained tests that were either too narrow, rejecting functionally correct submissions, or too broad, requiring behaviors not specified by the issue.[^4]
A third concern, sometimes called the scaffolding gap, was that reported SWE-bench Verified scores increasingly reflected the quality of an agent's surrounding scaffolding rather than the model's underlying capability. The same model could vary by 12 percentage points or more between a minimal harness and a heavily tuned agent loop on the same problems, undermining cross-vendor comparison.[^9]
SWE-Bench Pro was announced by Scale AI on September 19, 2025, accompanied by the technical paper "SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?" posted to arXiv as paper 2509.16941 on September 21, 2025 and revised on November 14, 2025. The paper lists Xiang Deng, Jeff Da, and Edwin Pan as lead authors, with 19 additional co-authors from Scale AI's research team.[^1][^2]
The announcement framed SWE-Bench Pro as "raising the bar for agentic coding" and explicitly positioned the benchmark as the successor to SWE-bench Verified. Scale AI released the dataset on Hugging Face under the identifier ScaleAI/SWE-bench_Pro and published the evaluation harness as an open-source repository at scaleapi/SWE-bench_Pro-os under the MIT License, including Docker-based execution environments and integrations with both the SWE-agent framework and the lightweight mini-swe-agent scaffold.[^2][^10]
A central design goal of SWE-Bench Pro was to construct a benchmark whose problems are unlikely to appear in any major pretraining corpus. Scale AI achieved this through two complementary strategies, each applied to a different portion of the dataset.[^1][^3]
For the public and held-out portions, Scale AI selected repositories distributed under strong copyleft licenses, principally GPL variants. The rationale, stated in the paper, is that copyleft-licensed code is more likely to have been deliberately excluded from large-scale pretraining corpora because of the obligations that copyleft imposes on derivative works. This is a legal protection: even when the source code is publicly available, model developers face increased downstream risk from training on it.
For the commercial portion, Scale AI entered into formal partnerships with eighteen early-stage startups to license access to proprietary codebases. These commercial repositories are not publicly available at any point, and the benchmark problems derived from them are released only as task identifiers and evaluation results on Scale's leaderboard, with the underlying problem statements, code, and tests held privately by Scale AI. This is a data-access protection: even a determined attempt to contaminate training data could not retrieve the commercial split through web scraping.
The full benchmark draws from 41 repositories, partitioned into 11 public repositories, 12 held-out repositories, and 18 commercial repositories. Each repository contributes between 50 and 100 problem instances, with a strict cap of 100 to prevent any single project from dominating the benchmark. The repositories span "consumer applications, B2B services, and developer tooling platforms," and all are described as "actively maintained professional projects with substantial user bases."[^3]
The task curation pipeline applied to each repository proceeds in three stages, as described in the paper.[^3]
In the sourcing stage, Scale AI identifies consecutive commit pairs that capture an issue's resolution, taking the pre-fix state as the base environment and the diff between the two commits as the reference solution.
In the task description stage, human experts augment the raw issue and patch with a clear problem statement, a list of requirements that specify the expected behavior without prescribing implementation, and interface definitions where relevant. The intent is to remove ambiguity without simplifying the underlying engineering challenge. Every problem must involve at least 10 lines of code change.
In the environments stage, professional engineers construct Docker-based test environments for each problem, define fail-to-pass tests that verify resolution of the original issue, and define pass-to-pass tests that verify that previously working functionality continues to work after the patch is applied. Each test is executed three times during curation to filter out flaky tests, and human reviewers exclude tests deemed "too broad or not relevant" to the issue.
A key contrast between SWE-Bench Pro and earlier benchmarks is the scale of the tasks. The paper reports that reference solutions span an average of 107.4 lines of code across 4.1 files, with more than 100 tasks requiring more than 100 lines of modification. By comparison, the original SWE-bench, while drawn from real repositories, contains many tasks that can be resolved by editing a single function in a single file. SWE-Bench Pro tasks are explicitly described as "long-horizon" problems that "may require hours to days for a professional software engineer to complete."[^1][^3]
SWE-Bench Pro contains 1,865 problem instances in total, partitioned as follows:[^1][^2][^3]
The benchmark spans four programming languages: Python, JavaScript, TypeScript, and Go. This represents a significant expansion from SWE-bench and SWE-bench Verified, both of which are Python-only. The paper does not publish an exact per-language instance count, but it reports that performance varies substantially by language: models tend to achieve higher resolve rates on Go and Python tasks (sometimes exceeding 30 percent on the public split) and lower, more variable performance on JavaScript and TypeScript tasks, occasionally approaching near-zero resolution.[^1][^3]
Task types include bug fixes, feature requests, security patches, performance optimizations, and user-interface changes, reflecting the breadth of work performed against production codebases.[^11]
SWE-Bench Pro is evaluated using a test-based protocol inherited from the SWE-bench lineage. For each task instance, an agent is given the issue's problem statement and access to the pre-fix snapshot of the repository, and is expected to produce a unified-diff patch. The patch is applied to the repository, the test suite is executed inside the instance-specific Docker container, and the task is scored as resolved only if all fail-to-pass tests pass and no pass-to-pass tests regress.[^2][^10]
The Scale Labs leaderboard distinguishes between two evaluation regimes. The first uses the mini-swe-agent scaffold, a deliberately lightweight harness developed alongside the benchmark to minimize the influence of agent engineering on the score. The second is an uncapped regime, often labeled "Uncapped (turn limit 250)," in which submitters may use their own scaffolding subject only to an upper limit on the number of agent turns per task. Submissions on the public-set leaderboard are reported with 95 percent confidence intervals on the resolve rate, computed across the 731 problem instances.[^6][^7]
In published comparisons, OpenAI argued that the more standardized scaffolding offered by SWE-Bench Pro, in particular the mini-swe-agent regime, helps eliminate "scaffolding-driven" score inflation that had become endemic to SWE-bench Verified reporting.[^4][^9]
Scale AI publishes two distinct leaderboards: one for the public dataset (731 instances) and one for the private dataset, which combines the held-out and commercial splits. Submissions on the private leaderboard are evaluated by Scale AI on behalf of submitters, because the underlying problems are not distributed publicly.[^6][^7]
The original Scale AI launch reported the following top scores on the 731-instance public set, evaluated with Scale's standard agent scaffolding:[^1][^11]
| Model | Resolve rate |
|---|---|
| OpenAI GPT-5 | 23.3% |
| Claude Opus 4.1 | 22.7% |
| Claude Sonnet 4 | 17.6% |
| Gemini 2.5 Pro Preview | 13.5% |
| SWE-Smith-32B | 6.8% |
| OpenAI GPT-4o | 4.9% |
| Qwen-3 32B | 3.4% |
As of mid-2026, the Scale Labs leaderboard shows substantially higher scores reflecting subsequent frontier-model releases, with multiple systems clustered between 40 and 60 percent on the public split. Reported scores include the following, with confidence intervals as published by Scale Labs:[^6]
| Model | Resolve rate | Scaffolding |
|---|---|---|
| GPT-5.4 (xHigh) | 59.10 plus or minus 3.56% | mini-swe-agent |
| Muse Spark | 55.00 plus or minus 3.60% | mini-swe-agent |
| Claude Opus 4.6 (thinking) | 51.90 plus or minus 3.61% | mini-swe-agent |
| Gemini 3 Pro (thinking) | 46.10 plus or minus 3.60% | mini-swe-agent |
| Claude Opus 4.5 | 45.89 plus or minus 3.60% | Uncapped (250 turns) |
| Claude Sonnet 4.5 | 43.60 plus or minus 3.60% | Uncapped (250 turns) |
The Scale Labs private-dataset leaderboard, which evaluates models on the 858-instance held-out set combined with the 276-instance commercial set, generally shows lower scores than the public leaderboard, consistent with the dataset's design as a stronger generalization test. Selected published results include:[^7]
| Model | Resolve rate | Scaffolding |
|---|---|---|
| Claude Opus 4.6 (thinking) | 47.10 plus or minus 6.07% | mini-swe-agent |
| Muse Spark | 44.70 plus or minus 6.05% | mini-swe-agent |
| GPT-5.4 (xHigh) | 43.40 plus or minus 6.03% | mini-swe-agent |
| Gemini 3 Pro (thinking) | 32.20 plus or minus 5.69% | mini-swe-agent |
| GPT 5.2 Codex | 27.74 plus or minus 5.09% | Not specified |
| GPT-5.2 | 23.81 plus or minus 5.09% | Not specified |
| Claude Opus 4.5 | 23.44 plus or minus 5.07% | Not specified |
| Gemini 3 Pro | 17.95 plus or minus 4.78% | Not specified |
| Claude Opus 4.1 | 17.75 plus or minus 4.51% | Not specified |
| OpenAI GPT-5 | 14.86 plus or minus 4.20% | Not specified |
| Gemini 2.5 Pro Preview | 10.14 plus or minus 3.56% | Not specified |
| Claude Sonnet 4 | 9.06 plus or minus 3.39% | Not specified |
| OpenAI GPT-4o | 3.62 plus or minus 2.20% | Not specified |
Cross-checking the same model between the public and private splits illustrates the generalization gap: Claude Opus 4.1, for example, declined from 22.7 percent on the public set to 17.75 percent on the private set, consistent with Scale AI's claim that performance on truly unseen, proprietary codebases provides a more conservative estimate of underlying capability.[^7]
Beyond raw model evaluations, multiple AI-coding tool vendors have published SWE-Bench Pro results using their proprietary scaffolds running on top of frontier models. In one such submission, the Augment Code team reported that its Auggie agent, running on Claude Opus 4.5, reached 51.80 percent on the public set, compared to 50.21 percent for Cursor using the same model and 49.75 percent for Claude Code. Augment attributed the gap to its proprietary code-retrieval index rather than to differences in the base model.[^11]
Several structural differences distinguish SWE-Bench Pro from earlier benchmarks in the SWE-bench family.[^1][^3][^9]
| Property | SWE-bench | SWE-bench Verified | SWE-Bench Pro |
|---|---|---|---|
| Release | October 2023 | August 2024 | September 2025 |
| Instances | 2,294 | 500 | 1,865 |
| Languages | Python | Python | Python, JavaScript, TypeScript, Go |
| Repositories | 12 (open source) | Subset of SWE-bench | 41 (11 public, 12 held-out, 18 commercial) |
| Avg files changed | 1 to 2 | 1 to 2 | 4.1 |
| Avg lines changed | Tens | Tens | 107 |
| Contamination defense | None explicit | None explicit | Copyleft licensing, private commercial split |
| Curator | Princeton | OpenAI | Scale AI |
The performance gap between SWE-Bench Pro and its predecessors is also substantial. The same frontier models that achieved over 70 percent on SWE-bench Verified at the time of SWE-Bench Pro's launch scored around 23 percent on SWE-Bench Pro's public set, a drop of more than 45 percentage points.[^1][^9] This drop is consistent with the benchmark's stated design intent of restoring measurement headroom, and with the broader observation that SWE-bench Verified had become both saturated and contaminated.
SWE-Bench Pro is distinct from SWE-bench Multimodal, a separate variant introduced by the original SWE-bench authors that incorporates image-based issues, and from SWE-rebench, an independently maintained benchmark of continuously refreshed GitHub issues designed to mitigate contamination through temporal cycling. SWE-Bench Pro's contamination strategy relies on copyleft licensing and proprietary data, rather than on continuous refresh.[^4][^12]
SWE-Bench Pro received substantial attention from both the research community and frontier-model developers in the months following its release. By February 2026, OpenAI announced that it would no longer report SWE-bench Verified scores in connection with new model releases, citing both the contamination and test-quality findings of its audit, and recommended SWE-Bench Pro as the new standard for frontier coding evaluation.[^4][^5] OpenAI's developer-relations account framed the shift as a response to "model maturity," stating that "the standard for frontier coding evals is changing" and that the company would now "recommend reporting SWE-bench Pro" while collaborating with the broader industry on stronger evaluation standards.[^5]
Anthropic and Google DeepMind also began including SWE-Bench Pro numbers in technical reports for models released in late 2025 and 2026, although several vendors continued to report SWE-bench Verified scores in parallel during a transition period.[^6][^7]
Critics of SWE-Bench Pro have raised several concerns. Some observers note that the benchmark, despite its protections, is not immune to contamination. Audits have found cases of contamination in SWE-Bench Pro, though described as "significantly rarer and less egregious than SWE-bench Verified," and the long-term effectiveness of the copyleft-license heuristic depends on the data-curation practices of model developers continuing to honor it.[^4] A second concern is methodological: SWE-Bench Pro, like its predecessors, evaluates models in isolation from the code-review workflows, security constraints, and organizational standards that define real-world software engineering. The benchmark measures whether a single agent can produce a patch that passes a hand-curated test suite, not whether the patch would pass code review at the originating organization.[^9][^13]
A third concern relates to the scaffolding gap. While SWE-Bench Pro's mini-swe-agent regime is designed to neutralize scaffolding differences, the leaderboard also accepts submissions under the uncapped regime, where vendors can deploy proprietary agent loops. The gap between the two regimes on the same base model can exceed 5 percentage points, suggesting that scaffolding still has measurable influence on reported scores.[^6][^11]
Finally, the commercial portion of the benchmark, while a powerful contamination defense, introduces a tradeoff. Because the commercial-split problems are not publicly auditable, external researchers cannot inspect them for bias, ambiguity, or test quality, and must trust Scale AI's internal verification process. This is in contrast to SWE-bench Verified, whose problems were filtered by humans but remained publicly inspectable.[^9]
Despite these concerns, by mid-2026 SWE-Bench Pro had emerged as the most widely cited coding-agent benchmark in vendor announcements and in independent reporting, taking on a role analogous to that previously held by SWE-bench Verified and earlier by SWE-bench itself.[^4][^6] Its existence has also influenced the design of subsequent benchmarks, including SWE-Bench++ and SWE-rebench, both of which adopt some combination of contamination resistance and broader language coverage.[^12]
SWE-Bench Pro sits alongside several other contemporary code and reasoning benchmarks, including LiveCodeBench, Aider Polyglot, BigCodeBench, and broader-purpose evaluations such as MMLU and GPQA, as part of a portfolio of measurements used to characterize frontier model capabilities. Within that portfolio, SWE-Bench Pro is generally cited as the strongest available test of long-horizon, multi-file software engineering against codebases that frontier models are unlikely to have seen during training.