SWE-bench Verified is a human-validated subset of 500 software engineering tasks drawn from the original SWE-bench dataset, released by OpenAI on August 13, 2024 in collaboration with the Princeton NLP researchers who created SWE-bench. The benchmark was designed to address a fundamental problem with its predecessor: a substantial fraction of the original 2,294 test tasks were ambiguous, impossible to complete with the information provided, or invalidated by flaky unit tests. By filtering those tasks through a rigorous human annotation campaign, OpenAI produced a cleaner 500-task set that more reliably measured whether AI systems could resolve real GitHub issues in popular open-source Python repositories.
From its August 2024 debut through 2025, SWE-bench Verified became the dominant standard for evaluating agentic coding systems. Every major AI laboratory reported scores on it when announcing frontier models, and a dense commercial ecosystem of agents, scaffolds, and evaluation services built around it. By early 2026, however, OpenAI's own audits found that frontier models were reproducing benchmark solutions verbatim from training data and that the majority of the remaining unsolved tasks contained structural flaws the original human review had not caught. On February 23, 2026, OpenAI formally recommended against further use of the benchmark and endorsed SWE-bench Pro as the preferred successor for evaluating frontier coding capability.
The original SWE-bench, published at ICLR 2024 by Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan at Princeton University, assembled 2,294 real GitHub issue-and-pull-request pairs from twelve popular open-source Python repositories: django/django (850 tasks), sympy/sympy (386), scikit-learn/scikit-learn (229), sphinx-doc/sphinx (187), matplotlib/matplotlib (184), pytest-dev/pytest (119), pydata/xarray (110), astropy/astropy (95), pylint-dev/pylint (57), psf/requests (44), mwaskom/seaborn (22), and pallets/flask (11). Each task presented a model with a frozen repository state, an issue description, and a requirement to produce a patch that caused previously failing unit tests to pass while leaving already-passing tests unaffected.
Early evaluations revealed that the original dataset carried substantial noise. Many issue descriptions were written informally, omitting key details that any human contributor would have looked up in comments or linked pull requests. Some test patches checked for specific implementation choices rather than behavioral outcomes, meaning that an alternative correct solution would fail evaluation even if it solved the underlying bug. Others depended on non-deterministic behavior, network calls that could not be reproduced in sandboxed environments, or configuration states that differed between the annotation environment and the evaluation harness. The consequence was that a significant share of the 2,294 tasks functioned as a measurement of whether a model happened to use the same solution strategy as the original author rather than whether it could solve the problem in principle.
Benchmark noise matters most when the underlying performance is low: a 10% solve rate on a benchmark with 20% unusable tasks is qualitatively different from a 10% rate on a clean benchmark. In 2023, SWE-bench served its purpose as a difficult, aspirational target. By mid-2024, agents and scaffolding systems were pushing toward 15-20% resolution rates on the full test set, and the proportion of spurious failures was becoming a genuine confound. OpenAI concluded that a curated subset would produce more interpretable measurements of progress.
OpenAI announced SWE-bench Verified on August 13, 2024, through its preparedness blog. The project was built in close collaboration with the Princeton NLP team that authored the original benchmark; the final verified dataset was hosted at swebench.com alongside SWE-bench Lite and the full SWE-bench test set and became the primary leaderboard on the official project site.
The release coincided with GPT-4o's evaluation on the new subset: using the best-performing scaffold available at the time (Agentless), GPT-4o resolved 33.2% of the 500 verified tasks. That figure was roughly double GPT-4o's resolution rate on the original unfiltered test set (approximately 16%), illustrating the effect of removing unsolvable tasks. OpenAI presented this doubling not as evidence that GPT-4o had become dramatically better but as evidence that the original benchmark had been systematically underreporting model capability due to task noise.
The announcement was accompanied by the publication of the SWE-bench Verified dataset on Hugging Face (SWE-bench/SWE-bench_Verified) and the addition of a dedicated verified leaderboard at swebench.com. The dataset contained 500 rows in Parquet format covering instances from the same twelve repositories as the original benchmark, with each instance including the repository identifier, the base commit hash, the gold patch, the test patch, the problem statement, optional hint text from issue comments, the FAIL_TO_PASS test list, the PASS_TO_PASS test list, and a difficulty label. The dataset's issue dates ranged from January 2013 to August 2023, reflecting the full temporal span of the repositories from which it was drawn.
In its first weeks, SWE-bench Verified attracted rapid uptake from the research and commercial communities. Within days of the release, scaffolding frameworks that had been targeting the full SWE-bench began reorienting their primary leaderboard submissions toward the Verified subset. The benchmark's cleaner signal and the credibility of OpenAI's methodological investment made it the natural target for teams seeking to demonstrate state-of-the-art performance.
OpenAI recruited 93 software developers with professional Python experience to serve as human annotators. Annotators did not redesign or fix tasks; they evaluated each task against a structured rubric and assigned severity labels indicating how much they impaired solvability.
The annotation addressed three dimensions for each task:
Issue description clarity. Annotators assessed whether the problem statement provided enough information to identify what needed to be changed. Underspecified descriptions that required consulting external links, patch history, or implicit community knowledge were flagged.
Test patch validity. Annotators checked whether the FAIL_TO_PASS tests reliably measured the stated behavior. Tests that required a specific implementation pattern (checking function names, checking internal data structure layout) rather than observable behavior were identified as narrow. Tests that covered additional functionality not mentioned in the issue were identified as wide.
Overall solvability. Annotators gave a holistic judgment about whether a skilled developer, given only the repository and the issue description as presented, could produce a correct patch.
Each of the 1,699 randomly sampled tasks from the original 2,294-task test set was independently reviewed by three annotators. To avoid discarding too many borderline cases, the severity labels were ensembled conservatively: the highest-severity label among the three annotators was taken as the final label. Tasks where the ensemble label exceeded a threshold on either the problem statement or the FAIL_TO_PASS tests were excluded. This process yielded the final 500-task verified set.
The resulting dataset still carried an estimated 5-10% residual error rate, a figure the SWE-bench team acknowledged at release. Human annotation at scale is imperfect, and the annotation rubric could not anticipate every form of task degradation.
At release, the evaluation harness used a containerized Linux environment per task, with network access disabled and the git history trimmed to the commit just before the issue was resolved. Models received the repository, the issue description, and optionally hint text drawn from issue comments. Solutions were applied as patches and scored by running the FAIL_TO_PASS tests (which should now pass) and the PASS_TO_PASS tests (which should remain passing). The primary metric was the fraction of tasks resolved, reported as a single percentage.
Evaluation tooling evolved substantially over the following 18 months. A major infrastructure revision released on February 12, 2026 (version 2.0.0) upgraded the scaffolding, containerized environments, and token limits. The new harness provided models with bash execution, a text editor, and patch application tools, and raised token budgets to 2 million uncached read/write tokens and 20 million cached token reads. These changes produced meaningfully higher scores across most models, and the SWE-bench team explicitly noted that results from version 1.x and version 2.x are not directly comparable.
The 500 verified tasks skew considerably toward simpler fixes. Approximately 39% of tasks are classified as trivial (resolvable in under 15 minutes by an experienced developer), 52% as small (15 minutes to 1 hour), and roughly 8-9% as hard (more than 1 hour). The median patch modifies around 14 lines of code, and trivial tasks average 5 lines. Approximately 85.8% of the verified tasks require changes to a single file, compared to 75.1% of the full test set and 50.3% of the training split. This concentration on single-file, smaller fixes reflects both the annotation team's filtering decisions and the natural distribution of the original dataset's simpler tasks being more likely to have clear, unambiguous issue descriptions.
The table below summarizes the principal structural differences between SWE-bench Verified and the full SWE-bench test set.
| Dimension | SWE-bench (full test) | SWE-bench Verified |
|---|---|---|
| Task count | 2,294 | 500 |
| Human validation | None | 93 annotators, 3 reviews each |
| Single-file issues | 75.1% | 85.8% |
| Multi-file issues | 24.9% | 14.2% |
| Estimated task error rate | Not assessed | 5-10% |
| Primary language | Python | Python |
| Repositories | 12 | Same 12 (subset of instances) |
| Official leaderboard | swebench.com/original | swebench.com/verified |
| GPT-4o baseline (Aug 2024) | ~16% | 33.2% |
The reduction from 2,294 to 500 tasks was not a random sample. Harder, more ambiguous, and multi-file tasks were disproportionately filtered out, which is why the verified set's single-file proportion is higher and why baseline scores approximately doubled. This selection effect has implications for interpreting high scores: a model that resolves 80% of verified tasks is resolving tasks that, on average, are simpler and better-specified than those in the full benchmark.
SWE-bench Lite, a 300-task random sample released by the Princeton team for fast and cheap evaluation, sits between the full benchmark and the Verified set in terms of accessibility. It was the primary fast evaluation target prior to the Verified release but was not designed for correctness filtering.
The following table documents the historical progression of the state-of-the-art score on SWE-bench Verified from its August 2024 release through May 2026. Scores from before and after the February 2026 v2.0.0 harness upgrade are not directly comparable, but both are included for historical context.
| Period | Top System / Model | Score | Notes |
|---|---|---|---|
| August 2024 | GPT-4o + Agentless | 33.2% | Launch baseline, announced in OpenAI blog |
| October 2024 | Claude 3.5 Sonnet (new) | 49.0% | Anthropic-reported; scaffold-assisted |
| Late 2024 | OpenAI o1 | 45-48% | o1-preview range; new reasoning approach |
| December 2024 | Various agent systems | ~55-62% | Competitive agent approaches; best ~62% |
| January 2025 | o1 + scaffold | 64.6% | Held SotA before multi-model ensembles |
| January-April 2025 | OpenAI o3 | ~72% | Announced with o3 model release |
| April-May 2025 | Multi-model ensembles | ~70-72% | Claude 3.7 Sonnet + o3 + Gemini 2.5 Pro combos |
| June 2025 | Various systems | ~74-75% | Incremental agent improvements |
| July-September 2025 | Claude Opus 4.5 | 80.9% | First to exceed 80%; new state-of-the-art |
| October-November 2025 | Claude Opus 4.6 | 80.8% | Near-identical to 4.5 |
| December 2025 - Jan 2026 | Various | 79-81% | Plateau; multiple models clustered near 80% |
| February 2026 | Claude Opus 4.7 | 87.6% | Post-v2.0.0 harness (higher token budget) |
| February 2026 | GPT-5.3 Codex | 85.0% | OpenAI submission |
| May 2026 | Claude Mythos Preview | 93.9% | Restricted preview model (Anthropic) |
The plateau around 80% in late 2025 was a significant inflection point. Progress had slowed from roughly 3-5 percentage points per month during the competitive 2024-2025 period to less than 1 percentage point over the final six months of 2025. OpenAI's audit found that the remaining failures were dominated by tasks with structural flaws rather than tasks genuinely challenging for frontier models, suggesting the benchmark had effectively saturated for practical purposes before the February 2026 harness upgrade temporarily produced higher absolute scores.
| Rank | Model | Organization | Score |
|---|---|---|---|
| 1 | Claude Mythos Preview | Anthropic | 93.9% |
| 2 | Claude Opus 4.7 | Anthropic | 87.6% |
| 3 | GPT-5.3 Codex | OpenAI | 85.0% |
| 4 | Claude Opus 4.5 | Anthropic | 80.9% |
| 5 | Claude Opus 4.6 | Anthropic | 80.8% |
| 6 | Gemini 3.1 Pro | 80.6% | |
| 6 | DeepSeek-V4-Pro-Max | DeepSeek | 80.6% |
| 8 | MiniMax M2.5 | MiniMax | 80.2% |
| 8 | Kimi K2.6 | Moonshot AI | 80.2% |
| 10 | GPT-5.2 | OpenAI | 80.0% |
A wide range of organizations have appeared on the leaderboard throughout the benchmark's active period. Most top submissions use some form of agentic scaffolding rather than bare model inference, and many use tool-assisted workflows, multi-turn conversation loops, or model ensembles. The swebench.com leaderboard distinguishes between agent-based and model-only submissions.
SWE-bench Pro, released by Scale AI in September 2025, was designed to address the limitations that had accumulated in SWE-bench Verified by the time it was approaching saturation. The table below contrasts the two benchmarks across the dimensions most relevant for practitioners.
| Dimension | SWE-bench Verified | SWE-bench Pro |
|---|---|---|
| Release | August 13, 2024 | September 19, 2025 |
| Task count | 500 | 1,865 (731 public + 858 held-out + 276 commercial) |
| Languages | Python only | Python, Go, JavaScript, TypeScript |
| Average change size | ~14 lines, mostly single-file | 107.4 lines across 4.1 files on average |
| Minimum change size | Some tasks require 1-2 lines | At least 10 lines per task |
| Task source | 12 popular open-source Python repos | 41 repositories including copyleft and private commercial code |
| Contamination mitigation | Human filtering (not contamination-focused) | GPL licensing and proprietary code to resist data leakage |
| Typical top score | 80-94% (frontier models) | 23-46% (frontier models) |
| OpenAI recommendation (2026) | Retired | Endorsed |
| Best frontier score (pub.) | ~87-94% | ~46% |
The difficulty gap is stark. Models that resolve 80%+ of SWE-bench Verified tasks typically resolve 20-25% of SWE-bench Pro tasks. Claude Opus 4.1 dropped from high SWE-bench Verified performance to 23.1% on the SWE-bench Pro public split, falling further to 22.7% on the commercial subset; GPT-5 showed similar drops from high Verified scores to 23.3% on Pro public and 14.9% on the commercial subset. These gaps reflect both the genuine difficulty increase from multi-file, multi-hundred-line tasks and the contamination effect: models trained on or near SWE-bench Verified data receive less benefit on a benchmark constructed to resist it.
The average SWE-bench Verified task involves roughly 11 lines of change; the average SWE-bench Pro task involves 107 lines across more than four files. In this respect, the Verified benchmark was most representative of small, targeted bug fixes -- the kind a developer might resolve in an afternoon -- while Pro tasks more closely resemble substantial feature work or deep refactoring requiring extended context management and multi-step reasoning about side effects.
During the period from August 2024 through early 2026, SWE-bench Verified functioned as the closest thing the AI coding field had to a universally accepted primary benchmark. Several factors drove this adoption:
Credibility from the source. OpenAI's involvement in creating the benchmark, and its use in OpenAI's own model announcements, gave SWE-bench Verified a prestige that research-community benchmarks without commercial backing rarely attain. When OpenAI used a number to characterize o1's coding capability, the industry had reason to take that methodology seriously.
Leaderboard infrastructure. The official swebench.com leaderboard was maintained by the original academic authors, accepted submissions from any organization, and distinguished between different submission types (agentless, agent, commercial). This neutral hosting reduced concerns about benchmark-owner advantage that affect proprietary leaderboards.
Real-world task origin. Unlike benchmarks constructed from synthetic problems, SWE-bench tasks derived from actual merged pull requests in widely deployed software. The benchmark therefore had face validity: resolving a real Django bug or a real scikit-learn issue meant something that synthetic code completion tasks did not.
Third-party tracking. Services including Epoch AI, llm-stats.com, benchlm.ai, and vals.ai tracked SWE-bench Verified scores alongside other benchmarks, enabling cross-model comparisons without requiring organizations to make primary submissions to the official leaderboard. This secondary ecosystem reinforced the benchmark's status as a standard.
Model release announcements. Anthropic, OpenAI, Google, DeepSeek, and a range of other organizations routinely cited SWE-bench Verified as a primary evidence point in model release documentation. The benchmark appeared in the technical reports for the Claude 3.5 Sonnet family, the o1 family, GPT-5 series models, and dozens of open-source model releases.
Commercial coding agent vendors including Devlo, TRAE, and others built marketing claims around SWE-bench Verified scores, sometimes competing to achieve state-of-the-art rankings as a form of product differentiation. This commercial competition drove rapid progress: the benchmark's top score moved from 33.2% in August 2024 to more than 80% within approximately one year, a pace that few benchmarks in machine learning history had matched.
Epoch AI incorporated SWE-bench Verified into its frontier capability tracking infrastructure alongside LiveCodeBench and Aider Polyglot, treating scores as one of several indicators of coding progress over time. The Epoch tracking contributed to visibility among researchers who monitored capability trajectories.
The 500-task verified set over-represents single-file, small-change bug fixes relative to the actual distribution of professional software work. A dataset where 85.8% of tasks involve a single file and 39% are solvable in under 15 minutes is not a representative sample of how engineers spend their time. Critics noted from early on that SWE-bench Verified measured a valuable but narrow slice of software engineering -- targeted, well-specified defect repair -- and should not be interpreted as a general assessment of coding agent capability.
Research comparing the single-file versus multi-file performance of top systems consistently showed a large gap: systems achieving 55-65% on overall SWE-bench Verified tasks resolved only around 20% of multi-file tasks. This gap suggested that the benchmark's headline numbers were substantially driven by the simplest tasks, and that headline progress masked limited improvement on the harder subset.
Five repositories -- Django, SymPy, Sphinx, Matplotlib, and scikit-learn -- accounted for more than 80% of the 500 tasks. Models extensively trained on these popular, heavily documented open-source projects were likely to have encountered the issues, discussions, and patches during pretraining, creating a persistent confound between benchmark performance and genuine generalization. The concentration also meant that organizations fine-tuning models on coding tasks could target SWE-bench Verified efficiently by focusing on these repositories.
Despite the 93-annotator review process, the SWE-bench team estimated a 5-10% residual error rate in the verified dataset from the start. Subsequent research found additional problems. A study using automated test coverage analysis (UTBoost) found that unit tests in 26 of the 500 tasks were insufficient to distinguish correct from incorrect solutions, meaning agents could pass these tests without actually fixing the issue. When leaderboard rankings were recalculated using fixed tests, 24% of agent rankings changed order, suggesting that the benchmark's rankings were partly measuring test coverage gaps rather than agent capability.
The most serious concern, and ultimately the one that led to the benchmark's formal retirement, was training data contamination. Because SWE-bench Verified drew from popular public repositories whose issue histories and pull request comments were extensively crawled for language model training, frontier models trained after the benchmark's construction had varying degrees of exposure to benchmark solutions.
OpenAI conducted a targeted audit of 138 SWE-bench Verified tasks that its o3 model did not consistently resolve, reviewing each with at least six experienced software engineers and re-verifying flagged cases with an additional team. Among the 138 audited tasks, 59.4% were found to contain material issues in test design or problem description rendering them extremely difficult or impossible to solve correctly even for expert human developers. The breakdown included:
Separately, OpenAI demonstrated that all three frontier models tested -- GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash Preview -- could reproduce the original gold-patch solutions verbatim from memory when prompted with only the task identifier. This verbatim reproduction confirmed that benchmark solutions had entered model training data, making high scores uninterpretable as evidence of novel problem-solving.
The combination of these findings -- unsolvable residual tasks and widespread contamination -- led OpenAI to announce on February 23, 2026 that it would no longer report SWE-bench Verified scores and recommended that other developers do the same. The recommendation acknowledged that SWE-bench Verified had served an important role in the field but concluded that it could no longer provide reliable signal about frontier model capability.
A consistent finding across evaluation studies was that scaffolding choices contributed up to 20 percentage points to observed scores, independent of which base model was used. This meant that a benchmark result could reflect scaffold engineering quality as much as underlying model capability. Different submission categories (agentless vs. agent-based) showed substantially different score distributions even when using the same base model, complicating direct comparisons across submitters who made different scaffolding choices.
The v2.0.0 upgrade released on February 12, 2026 substantially increased observed scores across all models by expanding token budgets from relatively constrained limits to 2 million uncached read/write tokens and 20 million cached token reads per task. The upgrade also added support for third-party scaffolds including Claude Code and Codex as first-class evaluation options, and made improvements to containerized environments and tool management.
The effect on scores was significant enough that v2.0.0 results are treated as a separate series. Models that had reached approximately 80% under v1.x evaluation conditions showed scores in the 80-90%+ range under v2.0.0. This version discontinuity complicates longitudinal comparisons and explains why the timeline table above separates pre- and post-v2.0.0 results.
The upgrade also came in close proximity to OpenAI's February 23 retirement announcement, meaning that the v2.0.0 era of SWE-bench Verified was brief. Most of the scores recorded under the new harness were gathered in the six to ten weeks between the harness upgrade and the formal recommendation to move to SWE-bench Pro.
SWE-bench (the original full test set) remained available and continued to be used for evaluations requiring the full task distribution. Its scores are systematically lower than SWE-bench Verified scores for any given system due to the harder and noisier tasks.
SWE-bench Lite (300 tasks) was the principal fast-evaluation target before SWE-bench Verified replaced it for full evaluations. Lite was a random sample rather than a quality-filtered one, and top scores on Lite tracked roughly 5-10 percentage points below Verified for comparable systems.
LiveCodeBench evaluates models on competitive programming problems continuously sourced from new contest problems, providing a contamination-resistant alternative. It measures a different capability slice (algorithmic problem solving on well-specified contest tasks rather than pragmatic bug repair in real codebases) and is typically used alongside SWE-bench variants rather than as a substitute.
Aider Polyglot evaluates multi-language code editing capability across a curated set of editing tasks, offering cross-language coverage that SWE-bench Verified's Python-only scope does not provide.
SWE-bench Pro, as described above, is the endorsed successor for frontier evaluations as of 2026, offering harder tasks, multi-language coverage, commercial codebase inclusion, and substantially reduced contamination exposure.
SWE-rebench is an independent re-evaluation platform that runs submitted solutions against stricter testing conditions than the official leaderboard to audit claimed scores. It identified cases where official scores could not be replicated and contributed to broader awareness of evaluation methodology concerns.