SWE-bench Verified

SWE-bench Verified
Overview
Full name	Software Engineering Benchmark, Verified subset
Abbreviation	SWE-bench Verified
Description	A 500-task human-validated subset of SWE-bench for evaluating AI agents on real-world GitHub issue resolution
Release date	2024-08-13[^1]
Latest harness version	v2.0.0 (February 12, 2026)[^2]
Authors	OpenAI Preparedness team and the Princeton SWE-bench authors[^1]
Organization	OpenAI, Princeton University
Technical details
Type	Code generation, bug fixing, software engineering
Modality	Text, code
Task format	GitHub issue resolution with unit-test grading (FAIL_TO_PASS and PASS_TO_PASS)
Number of tasks	500 (filtered from 1,699 sampled out of the original 2,294-instance SWE-bench test set)[^3]
Repositories	12 popular Python projects (Django, SymPy, scikit-learn, Sphinx, Matplotlib, pytest, xarray, astropy, pylint, requests, seaborn, Flask)[^4]
Evaluation metric	Resolve rate (% Resolved)
Languages	Python
Performance
Initial baseline (Aug 2024)	33.2% (GPT-4o + Agentless)[^1]
Pre-v2.0.0 SOTA (Sep 2025)	80.9% (Claude Opus 4.5)[^5]
Public SOTA (May 2026)	88.7% (GPT-5.5, OpenAI)[^6]; 87.6% (Claude Opus 4.7)[^7]
Highest reported	93.9% (Claude Mythos Preview, restricted)[^8]
Status	Deprecated by OpenAI for frontier evaluation on February 23, 2026[^9]
Resources
Website	swebench.com/verified.html
Announcement	OpenAI blog (Aug 2024)
Deprecation post	OpenAI blog (Feb 2026)
Dataset	Hugging Face: princeton-nlp/SWE-bench_Verified
Annotation guide	SWE-b Annotation Instructions (PDF)
License	MIT
Predecessor	SWE-bench
Endorsed successor	SWE-bench Pro

SWE-bench Verified is a human-validated subset of 500 software engineering tasks drawn from the original SWE-bench dataset, released by OpenAI on August 13, 2024 in collaboration with the Princeton NLP researchers who created SWE-bench.[^1] The benchmark was designed to address a fundamental problem with its predecessor: a substantial fraction of the original 2,294 test tasks were ambiguous, impossible to complete with the information provided, or invalidated by flaky unit tests. By filtering those tasks through a rigorous human annotation campaign, OpenAI produced a cleaner 500-task set that more reliably measured whether AI systems could resolve real GitHub issues in popular open-source Python repositories.[^3]

From its August 2024 debut through 2025, SWE-bench Verified became the dominant standard for evaluating agentic coding systems. Every major AI laboratory reported scores on it when announcing frontier models, and a dense commercial ecosystem of agents, scaffolds, and evaluation services built around it. By early 2026, however, OpenAI's own audits found that frontier models were reproducing benchmark solutions verbatim from training data and that the majority of the remaining unsolved tasks contained structural flaws the original human review had not caught.[^9] On February 23, 2026, OpenAI formally recommended against further use of the benchmark and endorsed SWE-bench Pro as the preferred successor for evaluating frontier coding capability.[^9]

Background: the need for a cleaner subset

The original SWE-bench, published at ICLR 2024 by Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan at Princeton University, assembled 2,294 real GitHub issue-and-pull-request pairs from twelve popular open-source Python repositories: django/django (850 tasks), sympy/sympy (386), scikit-learn/scikit-learn (229), sphinx-doc/sphinx (187), matplotlib/matplotlib (184), pytest-dev/pytest (119), pydata/xarray (110), astropy/astropy (95), pylint-dev/pylint (57), psf/requests (44), mwaskom/seaborn (22), and pallets/flask (11).[^4] Each task presented a model with a frozen repository state, an issue description, and a requirement to produce a patch that caused previously failing unit tests to pass while leaving already-passing tests unaffected.

Early evaluations revealed that the original dataset carried substantial noise. Many issue descriptions were written informally, omitting key details that any human contributor would have looked up in comments or linked pull requests. Some test patches checked for specific implementation choices rather than behavioral outcomes, meaning that an alternative correct solution would fail evaluation even if it solved the underlying bug. Others depended on non-deterministic behavior, network calls that could not be reproduced in sandboxed environments, or configuration states that differed between the annotation environment and the evaluation harness.[^1] The consequence was that a significant share of the 2,294 tasks functioned as a measurement of whether a model happened to use the same solution strategy as the original author rather than whether it could solve the problem in principle.

Benchmark noise matters most when the underlying performance is low: a 10% solve rate on a benchmark with 20% unusable tasks is qualitatively different from a 10% rate on a clean benchmark. In 2023, SWE-bench served its purpose as a difficult, aspirational target. By mid-2024, agents and scaffolding systems were pushing toward 15-20% resolution rates on the full test set, and the proportion of spurious failures was becoming a genuine confound. OpenAI concluded that a curated subset would produce more interpretable measurements of progress.[^1]

Origin and release

OpenAI announced SWE-bench Verified on August 13, 2024 through its preparedness blog.[^1] The project was built in close collaboration with the Princeton NLP team that authored the original benchmark; the final verified dataset was hosted at swebench.com alongside SWE-bench Lite and the full SWE-bench test set and became the primary leaderboard on the official project site.

The release coincided with GPT-4o's evaluation on the new subset: using the best-performing scaffold available at the time (Agentless), GPT-4o resolved 33.2% of the 500 verified tasks.[^1] That figure was roughly double GPT-4o's resolution rate on the original unfiltered test set (approximately 16%), illustrating the effect of removing unsolvable tasks. OpenAI presented this doubling not as evidence that GPT-4o had become dramatically better but as evidence that the original benchmark had been systematically underreporting model capability due to task noise.[^1]

The announcement was accompanied by the publication of the SWE-bench Verified dataset on Hugging Face (princeton-nlp/SWE-bench_Verified) and the addition of a dedicated verified leaderboard at swebench.com.[^10] The dataset contained 500 rows in Parquet format covering instances from the same twelve repositories as the original benchmark, with each instance including the repository identifier, the base commit hash, the gold patch, the test patch, the problem statement, optional hint text from issue comments, the FAIL_TO_PASS test list, the PASS_TO_PASS test list, and a difficulty label. The dataset's issue dates ranged from January 2013 to August 2023, reflecting the full temporal span of the repositories from which it was drawn.

In its first weeks, SWE-bench Verified attracted rapid uptake from the research and commercial communities. Within days of the release, scaffolding frameworks that had been targeting the full SWE-bench began reorienting their primary leaderboard submissions toward the Verified subset. The benchmark's cleaner signal and the credibility of OpenAI's methodological investment made it the natural target for teams seeking to demonstrate state-of-the-art performance.

Methodology

Annotation campaign

OpenAI recruited 93 software developers with professional Python experience to serve as human annotators.[^3] Each prospective annotator first had to pass an onboarding test built around 50 samples that OpenAI's own engineers had hand-labeled to high confidence; detailed feedback was provided during onboarding to calibrate annotators against the rubric.[^3] Annotators did not redesign or fix tasks; they evaluated each task against a structured rubric and assigned severity labels indicating how much they impaired solvability.

The annotation addressed three dimensions for each task:[^3]

Issue description clarity. Annotators assessed whether the problem statement provided enough information to identify what needed to be changed. Underspecified descriptions that required consulting external links, patch history, or implicit community knowledge were flagged.
Test patch validity. Annotators checked whether the FAIL_TO_PASS tests reliably measured the stated behavior. Tests that required a specific implementation pattern (checking function names, checking internal data structure layout) rather than observable behavior were identified as narrow. Tests that covered additional functionality not mentioned in the issue were identified as wide.
Overall solvability. Annotators gave a holistic judgment about whether a skilled developer, given only the repository and the issue description as presented, could produce a correct patch.

Each of the 1,699 randomly sampled tasks from the original 2,294-task test set was independently reviewed by three annotators.[^3] To avoid discarding too many borderline cases, the severity labels were ensembled conservatively: the highest-severity label among the three annotators was taken as the final label. Tasks where the ensemble label exceeded a threshold on either the problem statement or the FAIL_TO_PASS tests were excluded. The team prioritised including as many samples in the 1-4 hour and >4 hour difficulty buckets as possible, then randomly sampled to arrive at the final 500-task verified set.[^3]

The resulting dataset still carried an estimated 5-10% residual error rate, a figure the SWE-bench team acknowledged at release.[^1] Human annotation at scale is imperfect, and the annotation rubric could not anticipate every form of task degradation.

Evaluation harness

At release, the evaluation harness used a containerized Linux environment per task, with network access disabled and the git history trimmed to the commit just before the issue was resolved. Models received the repository, the issue description, and optionally hint text drawn from issue comments. Solutions were applied as patches and scored by running the FAIL_TO_PASS tests (which should now pass) and the PASS_TO_PASS tests (which should remain passing). The primary metric was the fraction of tasks resolved, reported as a single percentage.[^4]

Evaluation tooling evolved substantially over the following 18 months. A major infrastructure revision released on February 12, 2026 (version 2.0.0) upgraded the scaffolding, containerized environments, and token limits.[^2][^11] The new harness exposed models to a bash command tool, an Anthropic-style text_editor tool, and an OpenAI-style apply_patch tool, and added first-class support for third-party scaffolds including Claude Code and Codex; per-task token budgets were raised to roughly 2 million uncached read/write tokens and 20 million cached token reads.[^2][^11] These changes produced meaningfully higher scores across most models, and the SWE-bench team explicitly noted that results from version 1.x and version 2.x are not directly comparable.[^11]

Difficulty distribution

The 500 verified tasks skew considerably toward simpler fixes. Approximately 39% of tasks are classified as trivial (resolvable in under 15 minutes by an experienced developer), 52% as small (15 minutes to 1 hour), and roughly 8-9% as hard (more than 1 hour).[^3] The median patch modifies around 14 lines of code, and trivial tasks average 5 lines. Approximately 85.8% of the verified tasks require changes to a single file, compared to 75.1% of the full test set and 50.3% of the training split. This concentration on single-file, smaller fixes reflects both the annotation team's filtering decisions and the natural distribution of the original dataset's simpler tasks being more likely to have clear, unambiguous issue descriptions.

Comparison with original SWE-bench

The table below summarizes the principal structural differences between SWE-bench Verified and the full SWE-bench test set.

Dimension	SWE-bench (full test)	SWE-bench Verified
Task count	2,294	500
Human validation	None	93 annotators, 3 reviews each[^3]
Single-file issues	75.1%	85.8%
Multi-file issues	24.9%	14.2%
Estimated task error rate	Not assessed	5-10% at release[^1]
Primary language	Python	Python
Repositories	12	Same 12 (subset of instances)
Official leaderboard	swebench.com (original tab)	swebench.com/verified
GPT-4o baseline (Aug 2024)	~16%	33.2%[^1]

The reduction from 2,294 to 500 tasks was not a random sample. Harder, more ambiguous, and multi-file tasks were disproportionately filtered out, which is why the verified set's single-file proportion is higher and why baseline scores approximately doubled.[^1] This selection effect has implications for interpreting high scores: a model that resolves 80% of verified tasks is resolving tasks that, on average, are simpler and better-specified than those in the full benchmark.

SWE-bench Lite, a 300-task random sample released by the Princeton team for fast and cheap evaluation, sits between the full benchmark and the Verified set in terms of accessibility. It was the primary fast evaluation target prior to the Verified release but was not designed for correctness filtering.

Score progression and leaderboard history

The following table documents the historical progression of the state-of-the-art score on SWE-bench Verified from its August 2024 release through May 2026. Scores from before and after the February 2026 v2.0.0 harness upgrade are not directly comparable, but both are included for historical context.

Period	Top system / model	Score	Notes
August 2024	GPT-4o + Agentless	33.2%	Launch baseline announced in OpenAI blog[^1]
October 2024	Claude 3.5 Sonnet (new)	49.0%	Anthropic-reported with tool-use scaffold[^12]
Late 2024	OpenAI o1 (preview)	45-48%	Reasoning-model approach
December 2024	Various agent systems	~55-62%	Competitive agent approaches; best ~62%
January 2025	o1 + scaffold	64.6%	Held SOTA before multi-model ensembles
February 2025	Claude 3.7 Sonnet	70.3%	First model to clear 70%, with custom scaffold (62.3% bare)[^13]
April 2025	OpenAI o3	~72%	Announced alongside o3 release
April-May 2025	Multi-model ensembles	~70-72%	Claude 3.7 Sonnet + o3 + Gemini 2.5 Pro combos
May 2025	Claude Opus 4 (high compute)	79.4%	Anthropic launch figure for the Claude 4 family[^14]
June 2025	Various systems	~74-75%	Incremental agent improvements
Sept-Oct 2025	Claude Opus 4.5	80.9%	First model to clearly exceed 80%[^5]
Oct-Nov 2025	Claude Opus 4.6	80.8%	Near-identical to 4.5
Dec 2025 - Jan 2026	Various	79-81%	Plateau; multiple models clustered near 80%
February 2026	Claude Opus 4.7	87.6%	First v2.0.0 SOTA, post-harness upgrade[^7]
February 2026	GPT-5.3 Codex	85.0%	OpenAI submission, v2.0.0
April 23, 2026	GPT-5.5	88.7%	OpenAI-reported, briefly the highest publicly available[^6]
April-May 2026	Claude Mythos Preview	93.9%	Restricted-access research model (Anthropic)[^8]

The plateau around 80% in late 2025 was a significant inflection point. Progress had slowed from roughly 3-5 percentage points per month during the competitive 2024-2025 period to less than 1 percentage point over the final six months of 2025. OpenAI's audit found that the remaining failures were dominated by tasks with structural flaws rather than tasks genuinely challenging for frontier models, suggesting the benchmark had effectively saturated for practical purposes before the February 2026 harness upgrade temporarily produced higher absolute scores.[^9]

The brief post-harness era introduced a different headline number but did not fundamentally change the underlying picture. Once the token budget expanded to 2M uncached and 20M cached, model scores compressed into a roughly 10-percentage-point band at the very top of the scale.[^2] Anthropic's Claude Mythos Preview, an internal research model first surfaced publicly on April 7, 2026 and only made available through the cybersecurity-focused Project Glasswing consortium, demonstrated the limit of this dynamic: at 93.9%, only roughly 30 of the 500 verified tasks remained unresolved, and OpenAI's earlier audit had already found that a majority of comparably difficult unresolved tasks were structurally broken.[^8][^9] The Mythos figure therefore came to be interpreted less as evidence of dramatic capability gains and more as confirmation that the benchmark's remaining headroom was largely artifactual.

Selected models as of May 2026 (v2.0.0+ harness)

Rank	Model	Organization	Score	Source
1	Claude Mythos Preview (restricted)	Anthropic	93.9%	[^8]
2	GPT-5.5	OpenAI	88.7%	[^6]
3	Claude Opus 4.7	Anthropic	87.6%	[^7]
4	GPT-5.3 Codex	OpenAI	85.0%	[^15]
5	Claude Opus 4.5	Anthropic	80.9%	[^5]
6	Claude Opus 4.6	Anthropic	80.8%	[^15]
7	Gemini 3.1 Pro	Google	80.6%	[^15]
7	DeepSeek-V4-Pro-Max	DeepSeek	80.6%	[^16]
9	MiniMax M2.5	MiniMax	80.2%	[^15]
9	Kimi K2.6	Moonshot AI	80.2%	[^16]
11	GPT-5.2	OpenAI	80.0%	[^15]

A wide range of organizations have appeared on the leaderboard throughout the benchmark's active period. Most top submissions use some form of agentic scaffolding rather than bare model inference, and many use tool-assisted workflows, multi-turn conversation loops, or model ensembles. The swebench.com leaderboard distinguishes between agent-based and model-only submissions.[^11]

Scaffold and harness reporting

A recurring source of confusion in SWE-bench Verified reporting was that submitted scores blended three distinct contributions: the underlying model, the scaffold around the model, and the evaluation harness within which both were executed. The official leaderboard categorized submissions into agentless (single-shot patch generation given a problem statement and retrieved context), agent (multi-turn loops with bash, file editing, and test execution tools), and commercial scaffolds (proprietary products such as Devlo, Verdent, TRAE, Auggie, Cursor's background agent, and Claude Code).[^11] Each category produced systematically different score distributions even for an identical base model.

Third-party benchmarking by Verdent in early 2026 found that swapping scaffolds while keeping the underlying model fixed shifted scores by up to 12 percentage points.[^17] Their own internal scaffold outperformed the native Claude Code scaffold for Claude Sonnet 4.5, while Anthropic-reported numbers used a different in-house harness optimized for Anthropic models. Augment Code separately demonstrated that combining Claude 3.7 Sonnet for code editing with OpenAI o1 for planning produced a higher score than either model alone, illustrating that ensembling across vendors could yield scaffold-driven gains independent of any single model improvement.[^18]

The scaffold-versus-model contribution split was particularly contentious when comparing across vendor announcements. When Anthropic and OpenAI each cited SWE-bench Verified figures for their flagship models, the underlying scaffolds were generally not identical. Anthropic-reported Claude figures from 2025 typically used Claude's tool-use loop with bash and file editing primitives, while OpenAI-reported o3 figures used a different agentic loop tuned for OpenAI's reasoning models. Direct head-to-head comparisons therefore required either using a single neutral scaffold for all models (an approach favored by Epoch AI and BenchLM) or normalizing for scaffold quality across submissions.[^19][^20] Neither approach was universally adopted, and reported figures often reflected the scaffold that best showcased the submitter's model rather than the scaffold most reflective of typical end-user deployment.

The February 2026 v2.0.0 harness upgrade attempted to standardize this layer by formalizing first-class support for third-party scaffolds including Claude Code and Codex.[^2] This made it possible to submit the same model under multiple scaffolds and observe the delta directly, but it also fragmented the leaderboard further: a single model could now appear several times under different scaffolds, complicating ordinal rankings.

Comparison with SWE-bench Pro

SWE-bench Pro, released by Scale AI in September 2025, was designed to address the limitations that had accumulated in SWE-bench Verified by the time it was approaching saturation.[^21] The table below contrasts the two benchmarks across the dimensions most relevant for practitioners.

Dimension	SWE-bench Verified	SWE-bench Pro
Release	August 13, 2024[^1]	September 19, 2025[^21]
Task count	500	1,865 (731 public + 858 held-out + 276 commercial)[^21]
Languages	Python only	Python, Go, JavaScript, TypeScript[^21]
Average change size	~14 lines, mostly single-file	107.4 lines across 4.1 files on average[^21]
Minimum change size	Some tasks require 1-2 lines	At least 10 lines per task[^21]
Task source	12 popular open-source Python repos	41 repositories including copyleft and private commercial code[^21]
Contamination mitigation	Human filtering (not contamination-focused)	GPL licensing and proprietary code to resist data leakage[^21]
Typical top score	80-94% (frontier models)	23-46% (frontier models)[^9][^7]
OpenAI recommendation (2026)	Retired[^9]	Endorsed[^9]

The difficulty gap is stark. Models that resolve 80%+ of SWE-bench Verified tasks typically resolve only 20-25% of SWE-bench Pro tasks. Claude Opus 4.1 dropped from high SWE-bench Verified performance to 23.1% on the SWE-bench Pro public split, falling further to 22.7% on the commercial subset; GPT-5 showed similar drops from high Verified scores to 23.3% on Pro public and 14.9% on the commercial subset.[^21] These gaps reflect both the genuine difficulty increase from multi-file, multi-hundred-line tasks and the contamination effect: models trained on or near SWE-bench Verified data receive less benefit on a benchmark constructed to resist it.[^9]

The average SWE-bench Verified task involves roughly 11 lines of change; the average SWE-bench Pro task involves 107 lines across more than four files.[^21] In this respect, the Verified benchmark was most representative of small, targeted bug fixes, the kind a developer might resolve in an afternoon, while Pro tasks more closely resemble substantial feature work or deep refactoring requiring extended context management and multi-step reasoning about side effects.

The operational gap between the two benchmarks is also large. A typical SWE-bench Verified evaluation can be completed in a few hours on commodity infrastructure for a few hundred dollars in inference costs. A full SWE-bench Pro evaluation with strong scaffolding takes substantially longer and often costs an order of magnitude more in inference, which has slowed broad adoption among smaller research groups. This cost asymmetry is one reason that, despite OpenAI's February 2026 retirement recommendation, SWE-bench Verified continued to be cited in many third-party comparisons through mid-2026.

Saturation and contamination

By late 2025, the question of whether SWE-bench Verified had saturated as a measurement instrument was being actively discussed across the AI research community. Saturation in benchmark terminology does not necessarily mean that models had reached perfect performance; it means that further increases in headline score were no longer providing useful signal about model capability, either because the remaining tasks were broken, because models had been trained on the answers, or because the benchmark only measured a narrow slice of relevant capability.

The February 23, 2026 retirement announcement by OpenAI's Frontier Evals team made the saturation case explicit.[^9] The team's blog post stated that SWE-bench Verified had become saturated and highly contaminated and that it was no longer measuring coding performance improvements well. The announcement explicitly recommended that other model developers stop reporting SWE-bench Verified scores in primary model launch materials and shift attention to SWE-bench Pro and similar contamination-resistant benchmarks.[^9]

The industry response was mixed. Some vendors moved quickly to deemphasize Verified scores in subsequent model announcements; Anthropic's release notes for Claude Opus 4.7 in April 2026 included SWE-bench Verified figures but framed them as legacy comparators alongside SWE-bench Pro scores.[^7] Other vendors, particularly those for whom Verified figures were more flattering than Pro figures, continued to cite Verified prominently into Q2 2026. Independent trackers including Epoch AI, llm-stats.com, vals.ai, and BenchLM continued to maintain Verified leaderboards on the grounds that historical continuity was useful even if the figures had limited frontier signal value.[^19][^20][^22]

The Latent Space podcast episode "The End of SWE-Bench Verified", recorded with OpenAI Frontier Evals members Mia Glaese (VP of Research) and Olivia Watkins shortly after the retirement announcement, became one of the most-cited primary discussions of the decision.[^23] The episode emphasized that the retirement was not a repudiation of the benchmark's prior utility but rather a recognition that no benchmark constructed from public training-adjacent data could remain frontier-grade indefinitely. The hosts and guests framed SWE-bench Verified's lifecycle as a case study in benchmark obsolescence rather than a critique of the original construction.

Industry adoption

During the period from August 2024 through early 2026, SWE-bench Verified functioned as the closest thing the AI coding field had to a universally accepted primary benchmark. Several factors drove this adoption:

Credibility from the source. OpenAI's involvement in creating the benchmark, and its use in OpenAI's own model announcements, gave SWE-bench Verified a prestige that research-community benchmarks without commercial backing rarely attain. When OpenAI used a number to characterize o1's coding capability, the industry had reason to take that methodology seriously.

Leaderboard infrastructure. The official swebench.com leaderboard was maintained by the original academic authors, accepted submissions from any organization, and distinguished between different submission types (agentless, agent, commercial).[^11] This neutral hosting reduced concerns about benchmark-owner advantage that affect proprietary leaderboards.

Real-world task origin. Unlike benchmarks constructed from synthetic problems, SWE-bench tasks derived from actual merged pull requests in widely deployed software.[^4] The benchmark therefore had face validity: resolving a real Django bug or a real scikit-learn issue meant something that synthetic code completion tasks did not.

Third-party tracking. Services including Epoch AI, llm-stats.com, benchlm.ai, and vals.ai tracked SWE-bench Verified scores alongside other benchmarks, enabling cross-model comparisons without requiring organizations to make primary submissions to the official leaderboard.[^19][^20][^22] This secondary ecosystem reinforced the benchmark's status as a standard.

Model release announcements. Anthropic, OpenAI, Google, DeepSeek, and a range of other organizations routinely cited SWE-bench Verified as a primary evidence point in model release documentation. The benchmark appeared in the technical reports for the Claude 3.5 Sonnet family, the o1 family, the Claude 4 family, GPT-5 series models, and dozens of open-source model releases.[^12][^14][^6]

Commercial coding agent vendors including Devlo, TRAE, and others built marketing claims around SWE-bench Verified scores, sometimes competing to achieve state-of-the-art rankings as a form of product differentiation. This commercial competition drove rapid progress: the benchmark's top score moved from 33.2% in August 2024 to more than 80% within approximately one year, a pace that few benchmarks in machine learning history had matched.[^1][^5]

Epoch AI incorporated SWE-bench Verified into its frontier capability tracking infrastructure alongside LiveCodeBench and Aider Polyglot, treating scores as one of several indicators of coding progress over time.[^19] The Epoch tracking contributed to visibility among researchers who monitored capability trajectories.

Limitations and criticisms

Complexity skew

The 500-task verified set over-represents single-file, small-change bug fixes relative to the actual distribution of professional software work. A dataset where 85.8% of tasks involve a single file and 39% are solvable in under 15 minutes is not a representative sample of how engineers spend their time.[^3] Critics noted from early on that SWE-bench Verified measured a valuable but narrow slice of software engineering, targeted, well-specified defect repair, and should not be interpreted as a general assessment of coding agent capability.[^24]

Research comparing the single-file versus multi-file performance of top systems consistently showed a large gap: systems achieving 55-65% on overall SWE-bench Verified tasks resolved only around 20% of multi-file tasks. This gap suggested that the benchmark's headline numbers were substantially driven by the simplest tasks, and that headline progress masked limited improvement on the harder subset.

Repository concentration

Five repositories, Django, SymPy, Sphinx, Matplotlib, and scikit-learn, accounted for more than 80% of the 500 tasks. Models extensively trained on these popular, heavily documented open-source projects were likely to have encountered the issues, discussions, and patches during pretraining, creating a persistent confound between benchmark performance and genuine generalization.[^9] The concentration also meant that organizations fine-tuning models on coding tasks could target SWE-bench Verified efficiently by focusing on these repositories.

Residual annotation errors

Despite the 93-annotator review process, the SWE-bench team estimated a 5-10% residual error rate in the verified dataset from the start.[^1] Subsequent research found additional problems. A study using automated test coverage analysis (UTBoost) found that unit tests in 26 of the 500 tasks were insufficient to distinguish correct from incorrect solutions, meaning agents could pass these tests without actually fixing the issue. When leaderboard rankings were recalculated using fixed tests, 24% of agent rankings changed order, suggesting that the benchmark's rankings were partly measuring test coverage gaps rather than agent capability.[^25]

Training data contamination

The most serious concern, and ultimately the one that led to the benchmark's formal retirement, was training data contamination. Because SWE-bench Verified drew from popular public repositories whose issue histories and pull request comments were extensively crawled for language model training, frontier models trained after the benchmark's construction had varying degrees of exposure to benchmark solutions.[^9]

OpenAI conducted a targeted audit of 138 SWE-bench Verified tasks that its o3 model did not consistently resolve, reviewing each with at least six experienced software engineers and re-verifying flagged cases with an additional team.[^9] Among the 138 audited tasks, 59.4% were found to contain material issues in test design or problem description rendering them extremely difficult or impossible to solve correctly even for expert human developers. The breakdown included:[^9]

35.5% with narrow test cases that enforced specific implementation details rather than behavioral outcomes, failing functionally correct alternative solutions
18.8% with wide test cases checking for functionality not mentioned in the problem statement
Cases where the issue description was genuinely unsolvable without information not available in the repository at the benchmark snapshot time

Separately, OpenAI demonstrated that all three frontier models tested, GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash Preview, could reproduce the original gold-patch solutions verbatim from memory when prompted with only the task identifier.[^9] One reported example showed GPT-5.2 reproducing the exact code patch for a Django authentication fix; another showed Claude Opus 4.5 quoting an inline comment word-for-word; Gemini 3 Flash output a complete unified diff given only a task ID.[^23] This verbatim reproduction confirmed that benchmark solutions had entered model training data, making high scores uninterpretable as evidence of novel problem-solving.

The combination of these findings, unsolvable residual tasks and widespread contamination, led OpenAI to announce on February 23, 2026 that it would no longer report SWE-bench Verified scores and recommended that other developers do the same.[^9] The recommendation acknowledged that SWE-bench Verified had served an important role in the field but concluded that it could no longer provide reliable signal about frontier model capability.

Scaffold sensitivity

A consistent finding across evaluation studies was that scaffolding choices contributed up to 20 percentage points to observed scores, independent of which base model was used.[^17] This meant that a benchmark result could reflect scaffold engineering quality as much as underlying model capability. Different submission categories (agentless versus agent-based) showed substantially different score distributions even when using the same base model, complicating direct comparisons across submitters who made different scaffolding choices.

The February 2026 harness upgrade

The v2.0.0 upgrade released on February 12, 2026 substantially increased observed scores across all models by expanding token budgets from relatively constrained limits to roughly 2 million uncached read/write tokens and 20 million cached token reads per task.[^2][^11] The upgrade also added support for third-party scaffolds including Claude Code and Codex as first-class evaluation options, fixed a bug where token usage was not being counted correctly for Google reasoning models, and added new versions of the OpenAI apply_patch tool for post-GPT-5.1 models.[^2]

The effect on scores was significant enough that v2.0.0 results are treated as a separate series.[^19] Models that had reached approximately 80% under v1.x evaluation conditions showed scores in the 80-90%+ range under v2.0.0. This version discontinuity complicates longitudinal comparisons and explains why the timeline table above separates pre- and post-v2.0.0 results.

The upgrade also came in close proximity to OpenAI's February 23 retirement announcement, meaning that the v2.0.0 era of SWE-bench Verified was brief.[^9] Most of the scores recorded under the new harness were gathered in the six to ten weeks between the harness upgrade and the formal recommendation to move to SWE-bench Pro. By the time GPT-5.5 was released on April 23, 2026, scoring 88.7% on Verified, the figure was reported in OpenAI's release materials primarily as a continuity comparator with prior model generations rather than as a primary evidence point.[^6]

Legacy and post-retirement use

After the February 23, 2026 retirement announcement, SWE-bench Verified did not disappear from practical use. Several constituencies continued to find the benchmark useful even after OpenAI's recommendation to move on:

Smaller open-source projects with limited evaluation budgets continued to use Verified because running it was cheaper and faster than running SWE-bench Pro, and the benchmark was well-instrumented with reproducible Docker images.
Historical continuity mattered for longitudinal research. Tracking the trajectory of coding capability from GPT-4 through Claude Opus 4.7 required a common benchmark, and Verified was the only benchmark with continuous data from August 2024 onward.
Coding agent product comparisons persisted on Verified because the benchmark's high ceiling and well-understood failure modes made it useful for product-tier comparisons even if it had lost frontier signal. Vendors of mid-tier coding assistants, where 60-80% performance was the operating range, continued to cite Verified as their primary external benchmark through Q2 2026.
Pedagogical use in machine learning courses and tutorials continued, because the benchmark's task structure (a real issue, a real repository, a real test patch) made it an excellent teaching example for agentic evaluation methodology.

The Princeton NLP team and the broader SWE-bench maintainers continued to operate the leaderboard infrastructure and to accept new submissions through mid-2026.[^11] They acknowledged the contamination and saturation concerns but emphasized that the benchmark's documented limitations made it more transparent than many alternatives, even those positioned as successors.

Within frontier evaluation work, however, the benchmark's role had clearly shifted by May 2026. New frontier model releases were expected to report scores on SWE-bench Pro, Terminal-Bench, and similar harder benchmarks as primary evidence of coding capability, with SWE-bench Verified appearing as a legacy comparator if at all.[^9] The trajectory of the benchmark, from a research curiosity in mid-2024, through universal industry standard in 2025, to formal retirement in early 2026, became frequently cited as a compressed case study in how rapidly capability benchmarks can mature, saturate, and obsolesce in an era of fast model progress.

Relation to other benchmarks

SWE-bench (the original full test set) remained available and continued to be used for evaluations requiring the full task distribution. Its scores are systematically lower than SWE-bench Verified scores for any given system due to the harder and noisier tasks.[^4]

SWE-bench Lite (300 tasks) was the principal fast-evaluation target before SWE-bench Verified replaced it for full evaluations. Lite was a random sample rather than a quality-filtered one, and top scores on Lite tracked roughly 5-10 percentage points below Verified for comparable systems.

LiveCodeBench evaluates models on competitive programming problems continuously sourced from new contest problems, providing a contamination-resistant alternative. It measures a different capability slice (algorithmic problem solving on well-specified contest tasks rather than pragmatic bug repair in real codebases) and is typically used alongside SWE-bench variants rather than as a substitute.

Aider Polyglot evaluates multi-language code editing capability across a curated set of editing tasks, offering cross-language coverage that SWE-bench Verified's Python-only scope does not provide.

SWE-bench Pro, as described above, is the endorsed successor for frontier evaluations as of 2026, offering harder tasks, multi-language coverage, commercial codebase inclusion, and substantially reduced contamination exposure.[^21]

SWE-rebench is an independent re-evaluation platform that runs submitted solutions against stricter testing conditions than the official leaderboard to audit claimed scores. It identified cases where official scores could not be replicated and contributed to broader awareness of evaluation methodology concerns.[^26]

Terminal-Bench, released in late 2025, evaluates agentic systems on long-horizon shell and filesystem tasks rather than on git-style patch generation. It became one of the benchmarks most often cited alongside SWE-bench Pro after Verified's retirement, particularly for assessing whether models could perform sustained multi-step engineering work outside a narrow patch-and-test loop.

References

Background: the need for a cleaner subset

Origin and release

Methodology

Annotation campaign

Evaluation harness

Difficulty distribution

Comparison with original SWE-bench

Score progression and leaderboard history

Selected models as of May 2026 (v2.0.0+ harness)

Scaffold and harness reporting

Comparison with SWE-bench Pro

Saturation and contamination

Industry adoption

Limitations and criticisms

Complexity skew

Repository concentration

Residual annotation errors

Training data contamination

Scaffold sensitivity

The February 2026 harness upgrade

Legacy and post-retirement use

Relation to other benchmarks

See also

References

Improve this article

Related Articles

LLM Benchmarks Timeline

DROP (Discrete Reasoning Over Paragraphs)

Fox (benchmark)

GPQA

HumanEval

MATH

Background: the need for a cleaner subset

Origin and release

Methodology

Annotation campaign

Evaluation harness

Difficulty distribution

Comparison with original SWE-bench

Score progression and leaderboard history

Selected models as of May 2026 (v2.0.0+ harness)

Scaffold and harness reporting

Comparison with SWE-bench Pro

Saturation and contamination

Industry adoption

Limitations and criticisms

Complexity skew

Repository concentration

Residual annotation errors

Training data contamination

Scaffold sensitivity

The February 2026 harness upgrade

Legacy and post-retirement use

Relation to other benchmarks

See also

References

Related Articles

LLM Benchmarks Timeline

DROP (Discrete Reasoning Over Paragraphs)

Fox (benchmark)

GPQA

HumanEval

MATH