SWE-bench Verified

AI Benchmarks AI Code Generation Model Evaluation

31 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

30 citations

Revision

v6 · 6,273 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

SWE-bench Verified
Overview
Full name	Software Engineering Benchmark, Verified subset
Abbreviation	SWE-bench Verified
Description	A 500-task, human-validated subset of SWE-bench for evaluating AI agents on real-world GitHub issue resolution
Release date	2024-08-13
Latest version	1.0
Authors	OpenAI Preparedness team in collaboration with the Princeton SWE-bench team
Organization	OpenAI, Princeton University
Technical Details
Type	Code generation, bug fixing, software engineering
Modality	Text, code
Task format	GitHub issue resolution with unit-test grading
Number of tasks	500 (filtered from 1,699 candidates drawn from the original 2,294-instance SWE-bench)
Repositories	12 popular Python projects (Django, SymPy, scikit-learn, Sphinx, Matplotlib, pytest, xarray, astropy, pylint, requests, seaborn, Flask)
Evaluation metric	Resolve rate (% Resolved); FAIL_TO_PASS and PASS_TO_PASS test grading
Languages	Python
Performance
Initial baseline (Aug 2024)	33.2% (GPT-4o + Agentless)
Public SOTA	88.6% (Claude Opus 4.8, May 2026)
Internal SOTA reported	93.9% (Claude Mythos, Anthropic, 2026)
Status	Deprecated by OpenAI for frontier evaluation on February 23, 2026
Resources
Website	Official page
Announcement	OpenAI blog
Deprecation post	OpenAI blog (Feb 2026)
Dataset	Hugging Face
Annotation guide	SWE-b Annotation Instructions (PDF)
License	MIT
Predecessor	SWE-bench

SWE-bench Verified is a 500-problem, human-validated subset of the SWE-bench software engineering benchmark, released on August 13, 2024 by OpenAI's Preparedness team together with the original Princeton SWE-bench authors. Each task asks an AI agent to read a real GitHub issue from one of 12 popular open-source Python repositories, edit the codebase, and produce a patch that is graded automatically against hidden unit tests. What sets Verified apart from the parent benchmark is that 93 contracted software developers individually reviewed every candidate task to confirm that the problem statement is unambiguous, the unit tests fairly grade a correct solution, and the fix is achievable in the harness time budget.^[1] From late 2024 through early 2026 it was the single most-cited coding benchmark in frontier model launches, until OpenAI itself deprecated it for frontier evaluation on February 23, 2026 over test flaws and training-data contamination.^[2]^[3]

Where the original SWE-bench's 2,294 instances were noisy enough that small score differences were hard to interpret, Verified produced a clean leaderboard signal that top labs raced to climb. The first published number was 33.2% for GPT-4o paired with the Agentless scaffold in August 2024, which roughly doubled Agentless's 16% on the full benchmark once the noisiest tasks were removed.^[1] The public state-of-the-art then rose to 49% by October 2024, past 80% in November 2025, and to 88.6% for Claude Opus 4.8 in May 2026, with Anthropic's internal Claude Mythos posting 93.9%, prompting widespread agreement that Verified had saturated.^[3]^[4]^[29] OpenAI's deprecation post argued that gains in this range were no longer meaningful: more than 59% of the hardest unsolved tasks in their audit had broken or unfair tests, and every frontier model tested could reproduce verbatim portions of the dataset.^[3]

Despite the deprecation, SWE-bench Verified remains widely used as a sanity check, an instructional benchmark, and a reference point against which newer evaluations like SWE-bench Pro, SWE-Lancer, and SWE-bench Multimodal are calibrated. Its design choices, especially the 93-annotator review process and the FAIL_TO_PASS / PASS_TO_PASS grading scheme, set the template that successor benchmarks have refined rather than replaced.^[1]^[3]

What is SWE-bench Verified used for?

SWE-bench Verified shares its core mechanics with the original SWE-bench: each task pairs a real GitHub issue with the codebase state immediately before the human-written fix was merged, plus a set of unit tests that distinguish the buggy state from the fixed state. An AI agent is given the issue text and the repository, must produce a code patch, and is graded on whether the patch flips the failing tests to passing without breaking the previously passing tests. What distinguishes Verified is the curation layer on top.

The core differences from SWE-bench are quality, scale, and intended use. The original benchmark sampled all 2,294 issue-PR pairs the team could mine from 12 repositories. Verified shrinks the set to 500 instances that an experienced human annotator confirmed are well-specified, fairly tested, and solvable. The result is a benchmark where a passing patch can be confidently interpreted as a real fix rather than a coincidence with a thin or buggy test suite.^[1]

Verified was released alongside the OpenAI Preparedness Framework's broader effort to measure dangerous autonomous capability in frontier models. The Preparedness team viewed software engineering ability as a leading indicator of model autonomy, and Verified was created so that the autonomy score would not be polluted by ambiguous tasks. The benchmark therefore plays a dual role in the AI safety and capabilities literature: it is both a standard product comparison and an input to OpenAI's risk assessments.^[1]^[2]

Origin and motivation

What problem did the original SWE-bench have?

When SWE-bench launched in October 2023, the headline result was that even the strongest available model, Claude 2 with BM25 retrieval, resolved only 1.96% of tasks. Through 2024, agent scaffolds like SWE-agent and commercial products like Devin drove that number above 13%, but the leaderboard was already showing strange behavior. Some tasks were trivially passable through patterns that did not actually fix the underlying bug. Others were essentially impossible because the test suite checked for behavior the issue had never specified.

A later analysis by the SWE-bench+ team at OpenLM.ai estimated that roughly 32.67% of "successful" patches on the original benchmark involved solution leakage in the issue text or comments, and that approximately 60% of resolved instances showed some form of leakage when broader cues were considered. After filtering, the SWE-agent + GPT-4 score dropped from 12.47% to 3.97%, a roughly threefold reduction. Independent reviewers had also flagged tests that rejected functionally correct solutions because they checked for the exact wording or formatting used in the gold patch.^[1]^[7]

OpenAI's Preparedness team needed a more reliable measurement to feed into its Model Autonomy risk evaluations. The team also wanted a benchmark whose score could be cited in product launches without each lab having to caveat that the underlying tasks were noisy. Both goals pointed in the same direction: human-validate a subset of SWE-bench and publish a clean signal.

Who created SWE-bench Verified and when?

In early 2024, OpenAI Preparedness coordinated with the original SWE-bench authors at Princeton (Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan) on a curation effort that shipped on August 13, 2024. The goals were stated plainly in OpenAI's announcement, which described the three things human annotators were asked to ensure: "sample descriptions are well-specified and not too underspecified or otherwise unfair; ... unit tests correctly cover the intended solution; ... development environments can be reliably set up."^[1] The same post summarized the curation result, that "we worked with 93 software developers experienced in Python to manually screen SWE-bench samples for quality, and created SWE-bench Verified, a subset of the original test set ... consisting of 500 samples verified to be non-problematic."^[1]

Verified inherits the upstream benchmark's MIT license and uses the same Docker harness and grading scheme, so existing SWE-bench tooling continues to work. The two teams also coordinated on the official princeton-nlp/SWE-bench_Verified Hugging Face dataset and on hosting a separate Verified leaderboard at swebench.com.^[1]^[2]

Methodology

Annotator pool

OpenAI contracted 93 software developers experienced in Python to perform the annotation. Each annotator had to demonstrate working familiarity with the relevant repository ecosystems (web frameworks, scientific computing, data visualization). The pool was drawn from a vendor that supplied technical contractors to OpenAI for related Preparedness evaluations, and annotators were paid for their time rather than incentivized by quality bonuses, which the team argued kept ratings calibrated.^[1]^[2]

Sampling and triple review

From the 2,294-instance SWE-bench test set, OpenAI selected 1,699 random samples to label. Every sample was reviewed by three independent annotators using a structured rubric. The triple-review design was meant to surface disagreement on borderline cases without leaning on any single reviewer's judgment.^[1]^[2]

Annotation rubric

Annotators rated each task along four axes, with each axis scored on a 0 to 3 severity scale where 0 and 1 are minor concerns and 2 or 3 means the sample should be discarded:^[1]^[8]

Axis	Question being answered	Discard condition
Problem clarity	Is the GitHub issue well specified, with enough information for a developer to know what behavior is expected?	Severity 2 or 3 from any reviewer
Test fairness	Do the FAIL_TO_PASS and PASS_TO_PASS tests check the right thing without rejecting valid alternative solutions?	Severity 2 or 3 from any reviewer
Difficulty plausibility	Could an experienced developer realistically resolve the task within the time budget the harness allows?	Severity 2 or 3 from any reviewer
Other major issues	Any further blocker (broken environment, dependency drift, ambiguous spec) that would invalidate evaluation	Any reviewer flag

Alongside the discard rubric, annotators estimated how long an experienced developer would need to decide on and implement the fix, given the cleaned issue text. Those estimates feed the difficulty distribution discussed below.^[1]^[2]

Filtering outcomes

The annotation pass produced striking quality numbers. About 38.3% of the 1,699 candidates were flagged for underspecified problem statements, and 61.1% were flagged for unit tests that may unfairly mark valid solutions as incorrect. In total, roughly 68.3% of the original SWE-bench samples were judged inadequate and removed. The remaining 500 instances form SWE-bench Verified.^[1]^[3]^[7]

Filtering stage	Count	Notes
Original SWE-bench test set	2,294	Source for the candidate pool
Random sample for human review	1,699	Each reviewed by three annotators
Tasks failing problem-clarity rubric	~651 (38.3%)	At least one reviewer flagged severity 2 or 3
Tasks failing test-fairness rubric	~1,039 (61.1%)	Categories overlap; many tasks failed multiple criteria
Tasks discarded for any reason	~1,160 (68.3%)	Combined effect of all rubric axes
Final SWE-bench Verified	500	Released August 13, 2024

Difficulty distribution

Using the annotators' time estimates, the SWE-bench Verified team published a difficulty breakdown. Most tasks are short, which has been both a feature (cheap to evaluate) and a criticism (does not capture multi-day projects).^[1]^[9]

Difficulty	Estimated time to solve	Number of tasks	Share of dataset
Easy	< 15 minutes	196	39.2%
Medium	15 minutes to 1 hour	259	51.8%
Hard	1 to 4 hours	~42	~8.4%
Very hard	> 4 hours	3	0.6%

A companion analysis by Epoch AI confirmed the skew, finding that 91% of Verified tasks could be completed in under an hour and that the median gold patch changes only a handful of lines of code. Easy fixes average around 5 changed lines, while the longer tasks average closer to 50 changed lines. Only 14.2% of instances require edits to more than one file.^[9]^[10]

How does SWE-bench Verified differ from the original SWE-bench?

Verified inherits the construction pipeline of SWE-bench but differs in size, quality, and reporting role. The table below summarizes the key differences.

Property	Original SWE-bench	SWE-bench Verified
Release date	October 10, 2023	August 13, 2024
Lead organization	Princeton University	OpenAI Preparedness with Princeton
Number of tasks	2,294	500
Source repositories	12 Python projects	Same 12 Python projects
Curation	None beyond automated test validation	93 contracted annotators, triple review on 1,699 candidates
Discard rate	0% (full set)	~68.3% of reviewed candidates discarded
First published baseline	1.96% (Claude 2, BM25 retrieval)	33.2% (GPT-4o + Agentless scaffold)
Current public SOTA	Reported only on subsets	88.6% (Claude Opus 4.8, May 2026)
Status	Maintained as historical reference	Deprecated by OpenAI for frontier evals (Feb 2026)
Recommended successor	SWE-bench Pro, SWE-bench Live	SWE-bench Pro

The core mechanical difference is curation. Same repositories, same harness, same FAIL_TO_PASS / PASS_TO_PASS grading. What changed was the confidence one could place in any given score. On Verified, an 80% resolve rate plausibly means the agent solved 400 problems whose fixes a panel of reviewers agreed on. On the full benchmark, an 80% resolve rate would have included tasks where the test suite let multiple non-equivalent patches pass, or where the issue text omitted information the agent had to infer from the gold patch.

Repository and task structure

Like the parent benchmark, Verified draws from 12 Python open-source repositories. Each task includes the issue title and body, the codebase at the pre-fix commit, the dependency manifest, and the FAIL_TO_PASS and PASS_TO_PASS test sets needed to grade a candidate patch.

Repository	Domain	Approximate share of Verified
Django	Web framework	~45%
SymPy	Symbolic mathematics	~15%
Sphinx	Documentation generator	~10%
Matplotlib	Data visualization	~8%
scikit-learn	Machine learning library	~7%
Flask	Web microframework	~5%
Requests	HTTP library	~3%
Pytest	Testing framework	~2%
Astropy	Astronomy tools	~2%
Xarray	N-D labeled arrays	~1%
Seaborn	Statistical visualization	~1%
Pylint	Code analysis	~1%

Django dominates because it has both a long history of high-quality issue tracking and a large, well-tested codebase that supplies many candidate tasks. The five largest repositories account for roughly 85% of the dataset, which mirrors the distribution in the parent SWE-bench. The skew has been criticized for over-indexing the benchmark on Django-flavored web framework idioms, though most of the strongest agent submissions report consistent performance across the repository mix.^[1]^[9]

Evaluation framework

Containerized execution

Verified uses the same Docker harness as the broader SWE-bench family. Each task is associated with a specific environment image that fixes the Python version and the exact dependency versions present at the issue's commit. This containerization is what lets different labs compare scores: without pinned environments, NumPy or pytest version drift alone could swing scores by several percentage points.^[2]^[11]

How is a task graded?

A task is marked resolved if and only if every FAIL_TO_PASS test passes after applying the agent's patch and every PASS_TO_PASS test continues to pass. A patch that fixes the bug but breaks an unrelated regression test counts as a failure. A patch that does not apply cleanly because of whitespace drift also counts as a failure. The strictness of the grading is part of why Verified produces interpretable signals: an agent that achieves 80% has met both the bug-fix and the no-regression bars on 400 separate tasks.^[1]^[2]^[11]

Reporting conventions

The primary metric is % Resolved, the share of the 500 tasks for which the agent's patch passes the grading rule. Secondary metrics that became common in the late-2024 to early-2026 reporting cycle include:

Pass@k: success rate when the agent gets k independent attempts.
Cost per resolved task: dollar cost of the model and tools required to resolve a task on average.
Wall clock per task: median seconds the agent runs before producing a patch.
Variance across reruns: standard deviation of resolve rate across 3 to 5 evaluation seeds, since stochastic agents can swing several points run-to-run.

From mid-2025 onward, several leaderboards (notably swebench.com, llm-stats.com, and Scale's labs.scale.com) added cost columns next to the headline accuracy figure, partly in response to criticism that high-cost reasoning agents were inflating reported scores without delivering proportional value.^[4]^[5]^[12]

Historical leaderboard

Verified scores climbed faster than almost any other coding benchmark. The progression below tracks the public state-of-the-art at major checkpoints from launch through 2026.

Date	Public best % Resolved	Model / agent	Organization	Notes
Aug 2024	33.2%	GPT-4o + Agentless	OpenAI	First published baseline at launch
Aug 2024	~30%	Devin v0	Cognition Labs	Topped early Verified leaderboard from Cognition
Sep 2024	45.0%	Previous SOTA agent (composite scaffolds)	Various	Pre-Claude 3.5 Sonnet plateau
Oct 2024	49.0%	Claude 3.5 Sonnet (new)	Anthropic	First single-model crossing of 49%^[13]
Dec 2024	53.0%	OpenAI o1 + scaffold	OpenAI	Reasoning models begin to contribute
Feb 2025	62.3%	Claude 3.7 Sonnet	Anthropic	Extended thinking helps coding
Feb 2025	70.3%	Claude 3.7 Sonnet (extended thinking + custom scaffold)	Anthropic	First reported 70%+ Verified score
May 2025	64.93%	Claude Sonnet 4	Anthropic	New Sonnet baseline
Aug 2025	74.5%	Claude Opus 4.1	Anthropic	Headline number for many product launches
Aug 2025	~75%	GPT-5 (Codex scaffold)	OpenAI	OpenAI's first GPT-5 reporting on Verified
Nov 2025	80.9%	Claude Opus 4.5	Anthropic	First confirmed 80%+ score
Feb 2026	85.0%	GPT-5.3 Codex	OpenAI	OpenAI's last headline Verified release
Feb 2026	80.6%	Gemini 3.1 Pro	Google	Google's tied best
Feb 2026	80.2%	MiniMax M2.5	MiniMax	Open-weight tied score
Apr 2026	87.6%	Claude Opus 4.7 (1M context)	Anthropic	Public SOTA at time of OpenAI deprecation
May 2026	88.6%	Claude Opus 4.8	Anthropic	Highest generally-available score (released May 28, 2026)
2026	93.9%	Claude Mythos	Anthropic	Internal cybersecurity model; not generally available

The trajectory looks like a textbook capability ramp. From August 2024 to May 2026, the public state-of-the-art among generally-available models rose from roughly 33% to 88.6%, an increase of more than 55 percentage points in under two years. Three forces drove the gain in roughly equal proportion: stronger base models (Claude 3.5 to 4.8, GPT-4o to GPT-5.3), better reasoning (extended thinking, chain-of-thought, self-verification), and more capable agent scaffolds (Agentless, SWE-agent, OpenHands, Claude Code).

What is the highest SWE-bench Verified score?

As of June 2026, the highest score by a generally-available model is 88.6%, posted by Anthropic's Claude Opus 4.8 (released May 28, 2026), which edged out the April 2026 mark of 87.6% set by Claude Opus 4.7.^[29]^[30] Anthropic's internal Claude Mythos model is reported at 93.9%, but it is not generally available and is excluded from the public leaderboard.^[3]^[29] The public top of the leaderboard is dominated by Anthropic: the steel.dev aggregation snapshot (updated June 12, 2026) lists Claude Opus 4.8 (88.6%), Claude Opus 4.7 (87.6%), Claude Opus 4.5 (80.9%), and Claude Opus 4.6 (80.8%) above the strongest non-Anthropic entries, which cluster near 80% (Gemini 3.1 Pro 80.6%, MiniMax M2.5 80.2%, GPT-5.2 80.0%).^[30] Because OpenAI deprecated Verified for frontier evaluation in February 2026, these figures should be read as the closing state of the benchmark's competitive era rather than a live race.

Early Verified leaderboard and the role of Cognition Devin

In the days immediately after the August 13, 2024 release, the publicly visible Verified leaderboard was topped by Cognition Labs' Devin agent. Cognition had spent the first half of 2024 building Devin around the original SWE-bench full set and was in a strong position to evaluate against the curated subset on day one. Devin's early Verified results were in the high 20s to low 30s, comparable to the GPT-4o + Agentless baseline OpenAI published in the same week. This brief period mattered for the benchmark's perception: it showed that the leaderboard would be contested by both research labs and commercial agent vendors, and it set the template for marketing-quality SWE-bench reporting that subsequent product launches followed.^[14]^[15]

Top-15 snapshot (April 2026)

The table below is a snapshot of the public SWE-bench Verified leaderboard from April 2026, sourced from the public swebench.com leaderboard, llm-stats.com aggregation, and lab-reported scores. Internal-only models like Claude Mythos are excluded.^[4]^[5]

Rank	Model	Organization	% Resolved	Notes
1	Claude Opus 4.7	Anthropic	87.6%	1M context, released April 16, 2026
2	GPT-5.3 Codex	OpenAI	85.0%	OpenAI's final Verified release before deprecation
3	Claude Opus 4.5	Anthropic	80.9%	First confirmed public 80%+ result
4	Claude Opus 4.6	Anthropic	80.8%	Mid-Q1 2026 release
5	Gemini 3.1 Pro	Google	80.6%	Google DeepMind
6	MiniMax M2.5	MiniMax	80.2%	Open-weight
7	GPT-5.2	OpenAI	80.0%	Pre-Codex variant
8	Claude Sonnet 4.6	Anthropic	79.6%	Smaller, cheaper Anthropic model
9	Qwen3.6 Plus	Alibaba	78.8%	Released April 2026
10	Gemini 3 Flash	Google	78.0%	Cheaper Google variant
11	MiMo-V2-Pro	Xiaomi	78.0%	Open-source 1T-param model
12	GLM-5	Zhipu AI	77.8%	Open-source, 744B parameters
13	Muse Spark	Meta	77.4%	Meta Superintelligence Labs flagship
14	Claude Sonnet 4.5	Anthropic	77.2%	Late-2025 release
15	Kimi K2.5	Moonshot AI	76.8%	Open-source

In May 2026, after this snapshot, Anthropic released Claude Opus 4.8, which posted 88.6% on Verified and 69.2% on the recommended successor SWE-bench Pro, taking the top generally-available position.^[29]^[30]

Top scoring approaches and frameworks

The agent scaffolds that dominated the Verified leaderboard tended to combine an interactive shell or editor with a search tool, a test runner, and some form of self-verification. Several reference designs are worth singling out.

Agentless

Agentless, developed jointly by researchers at the University of Illinois Urbana-Champaign and Princeton, eschews an autonomous agent loop in favor of a fixed three-stage pipeline: localize the bug, repair it, and validate the fix. OpenAI used Agentless paired with GPT-4o for the original 33.2% Verified baseline because the simple pipeline made it easy to attribute scores to model capability rather than scaffold engineering. The cleaned task set roughly doubled Agentless's 16% score on the full SWE-bench, and the scaffold remained the most popular reference for cheap, reproducible Verified runs through 2025.^[1]^[16]

SWE-agent

SWE-agent, the official Princeton agent paper published at NeurIPS 2024, introduced the Agent-Computer Interface (ACI): a curated set of shell commands that make repository navigation and editing easier for language models. SWE-agent's scaffolds powered many of the open-source Verified entries, including the Live-SWE-agent variant that reached 79.2% with Claude Opus 4.5 in late 2025.^[17]

OpenHands (formerly OpenDevin)

OpenHands consolidated several research scaffolds into a community-driven open-source platform supporting browser, shell, and editor tools. It became the de facto open-source counterweight to closed agent products like Devin and was widely used in academic Verified submissions. The framework supports multiple LLM backends and ships with reusable plugins for retrieval, multi-attempt sampling, and verifier-based selection.^[18]

Aider

Aider, Paul Gauthier's open-source command-line coding assistant, runs benchmarks on Verified to compare different models on whole-file edits. It is mostly used as an interactive tool, but its public benchmark page played an outsized role in popularizing cost-adjusted reporting and the practice of comparing the same scaffold across many models.^[19]

Claude Code

Claude Code, Anthropic's terminal-based agent introduced in early 2025, became the reference scaffold for Anthropic's own Verified numbers. Its bash tool, file editor, and computer-use tool let the underlying Claude model do most of the heavy lifting with a relatively shallow surrounding harness. Claude Code's design influenced how other labs approached agent engineering, especially the move toward giving the model more direct shell access rather than wrapping it in heavy intermediation.^[20]

RA-Aid, Moatless, and AutoCodeRover

A second tier of agents has held steady positions in the mid-leaderboard. RA-Aid is a research agent built on top of OpenHands with a focus on retrieval-augmented planning. Moatless Tools is a lightweight scaffold popular with budget-conscious researchers. AutoCodeRover, developed at the National University of Singapore, uses spectrum-based fault localization to narrow the search space before patch generation. Each contributed important ideas (decoupled localization, deterministic patches, coverage-guided edits) that later commercial agents absorbed.^[21]^[22]

Other agents commonly seen on the leaderboard

Agent / framework	Developer	Approach
Amazon Q Developer Agent	Amazon	Enterprise agent with AWS tooling
Atlassian Rovo Dev	Atlassian	Agentic coding inside Jira and Bitbucket
Cursor Composer	Anysphere	IDE-based agent with human-in-the-loop edits
Codex	OpenAI	Cloud-based agent in sandboxed environments
Augment Code	Augment	Context-aware agent for large codebases
iSWE-Agent	IBM	Java-focused agent used on Multi-SWE-bench
CodeR	Independent	Multi-agent design with role specialization

Patterns common to the strongest submissions

Independent of any specific framework, the highest-scoring Verified entries through 2025 and 2026 tended to share a few patterns. They run multiple candidate patches and use a verifier (often an LLM-as-judge with regression testing) to pick the best one. They allow generous tool use (file search, AST queries, language servers, test runners) instead of restricting the agent to a fixed action set. They feed the test output back into the model in a tight loop so the agent can self-correct. And they explicitly handle whitespace and formatting drift in patch generation, which avoided the silent grading failures that plagued earlier submissions.

Why did OpenAI deprecate SWE-bench Verified?

On February 23, 2026, OpenAI's Mia Glaese and Olivia Watkins published "Why SWE-bench Verified no longer measures frontier coding capabilities." The post argued that the benchmark had reached the end of its useful life as a frontier evaluation and recommended SWE-bench Pro as the new community standard. OpenAI's core conclusion was blunt: "improvements on SWE-bench Verified no longer reflect meaningful improvements in models' real-world software development abilities," and "instead, they increasingly reflect how much the model was exposed to the benchmark at training time."^[3]

Findings cited in the deprecation post

OpenAI's case rested on three pillars:

Broken specifications dominate the unsolved tail. OpenAI audited 138 problems that the lab's models repeatedly failed to solve, representing roughly 27.6% of the dataset. They reported that more than 60% of the audited problems were unsolvable as stated. Forty-nine tests were too narrow, rejecting functionally correct submissions; twenty-six were too wide, requiring features that were never mentioned in the issue. In total, about 59.4% of the hardest unsolved problems had flawed test cases.^[3]^[23]
Training-data contamination is now pervasive. Every frontier model OpenAI tested could reproduce verbatim portions of the gold patch or problem statement when prompted with only a Verified Task ID. Models implicated in the audit included GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash. In one example, GPT-5.2's chain-of-thought traces revealed knowledge of unspecified test requirements that could only have come from training data exposure.^[3]
Saturation has compressed the meaningful signal. With public scores in the high 80s and an internal Anthropic score of 93.9%, the headroom remaining on Verified was within the noise of the harness. Marginal improvements no longer reflect generalizable capability; they reflect how much of the dataset the model was exposed to during training.^[3]

Recommendation

OpenAI recommended migrating to SWE-bench Pro, which uses 1,865 longer tasks across more diverse public, held-out, and commercial codebases. Pro tasks are explicitly chosen to require 1 to 4 hours of human effort or more, in contrast with Verified's bias toward sub-hour fixes. Initial Pro scores from frontier models ran 20 to 40 percentage points below their Verified scores, which OpenAI argued was the kind of measurement headroom that frontier evaluation requires. The gap held into mid-2026: Claude Opus 4.8 scored 88.6% on Verified but only 69.2% on SWE-bench Pro.^[3]^[24]^[29]

Industry response

The deprecation post was discussed extensively on the Latent Space podcast in the same week, with Glaese and Watkins arguing that the move was about measurement validity rather than a critique of Verified's design. Anthropic, Google DeepMind, and Meta did not immediately stop reporting Verified numbers, but most subsequent product launches paired the Verified score with a Pro score. Several outlets (CodeSOTA, blockchain.news, Latent Space, marc0.dev) wrote retrospectives within weeks framing the deprecation as a generational shift in how the field measures coding ability.^[3]^[6]^[23]^[24]^[25]

Successors and complementary benchmarks

OpenAI's deprecation accelerated work on a family of successor benchmarks that aim to restore measurement validity. The most influential are summarized below.

Successor	Released by	Released	Size	Why it matters
SWE-bench Pro	Scale AI	2025	1,865 long-horizon tasks across public, held-out, and commercial codebases	OpenAI's recommended successor; longer tasks, less contamination
SWE-bench Multimodal	Princeton SWE-bench team	October 2024	619 JavaScript tasks with embedded images	Tests visual grounding (UI screenshots, error images, diagrams)
SWE-bench Multilingual	Princeton SWE-bench team	2025	300 tasks across 9 languages (C, C++, Go, Java, JS, TS, PHP, Ruby, Rust)	Breaks the Python monoculture
Multi-SWE-bench	ByteDance Seed	2025 (NeurIPS 2025 D&B)	1,632 instances across 7 languages	Independent multilingual effort
SWE-bench Live	Microsoft Research	May 2025	1,565+ instances; updated monthly	Anti-contamination via post-cutoff issues
SWE-Lancer	OpenAI	February 2025	1,400+ Upwork tasks ($1M payouts)	Dollars-earned metric; freelance simulation
SWE-rebench	Independent	2025	Continuously updated Python set	Decontamination focus
SWE-bench+	OpenLM.ai	October 2024	Filtered subset of original SWE-bench	Removed leaked instances

SWE-Lancer

OpenAI launched SWE-Lancer in February 2025 to evaluate models on real freelance software engineering tasks scraped from Upwork and verified by Expensify. The benchmark covers more than 1,400 tasks with payouts totaling roughly $1 million in real dollars, ranging from $50 bug fixes to $32,000 feature implementations. Tasks split between independent contributor (IC) work and managerial decisions over technical proposals. Initial reported scores were modest: Claude 3.5 Sonnet completed 26.2% of IC tasks and 44.9% of managerial tasks, while OpenAI's o1 reached 20.3% and 46.3% respectively. The dollars-earned metric resonated outside the academic community and made SWE-Lancer popular with industry analysts.^[26]

SWE-bench Multimodal

Introduced in October 2024 (arXiv:2410.03859), SWE-bench Multimodal extends the harness to JavaScript repositories with image-bearing issues. The dataset contains 619 task instances drawn from 17 user-facing JavaScript repositories, with 862 images embedded across the problem statements. Image categories include code screenshots (194 instances), diagrams (107), error messages (54), digital art (38), maps (35), and data visualizations (28). Multimodal scores have lagged Verified scores by a wide margin because the task requires both visual grounding and code reasoning, and JavaScript test frameworks (Jest, Mocha, Playwright) make patch validation more involved than the Python set.^[27]

Limitations

Test suite weaknesses

OpenAI's deprecation audit confirmed earlier independent findings that Verified's test suites are not always trustworthy. An empirical study published as arXiv:2503.15223 reported that more than 15% of Verified instances have incomplete test patches that allow incorrect or partial solutions to pass. Specifically, 12.50% of passing patches were judged functionally or semantically incorrect, and 9.82% were incomplete. Frameworks like UTBoost and PatchDiff suggested leaderboard scores may be inflated by 6 to 7 percentage points due to test inadequacies. OpenAI's own audit reported even higher rates of broken specifications among the hardest unsolved tasks.^[3]^[28]

Data contamination

Verified inherits the parent benchmark's contamination problem. More than 94% of SWE-bench issues were filed before the knowledge cutoff dates of major pre-trained language models. Subsequent audits, including the OpenLM.ai SWE-bench+ analysis and OpenAI's February 2026 study, demonstrated that frontier models could regenerate parts of the gold patch or problem statement when prompted with only a Task ID. Even SWE-bench Live's monthly refresh did not entirely solve the problem because tasks rotate into model training data on a similar timescale.^[3]^[7]

Repository selection bias

The 12 source repositories are all open-source Python projects with strong test cultures. Many real-world codebases have sparse tests, proprietary dependencies, or architectural patterns not represented in this set. As a result, high Verified scores do not necessarily predict performance on arbitrary production codebases. The Django dominance in particular gives the dataset a web framework flavor that is over-represented compared with the broader software ecosystem.^[9]^[10]

Task-length skew

Epoch AI's analysis showed that the majority of Verified tasks are relatively simple. About 91% can be completed by a human in under one hour and 39.2% in under 15 minutes. The benchmark therefore primarily measures an agent's ability to fix straightforward bugs rather than tackle architectural changes or large feature implementations. This skew is the single biggest reason successor benchmarks like SWE-bench Pro and SWE-Lancer reset the difficulty floor.^[9]^[10]

Python-only coverage

Verified is Python-only. Performance on Verified does not reliably generalize to JavaScript, Java, C++, Go, or Rust. SWE-bench Multilingual, Multi-SWE-bench, and SWE-bench Live partially address this gap, but Verified's outsized leaderboard role meant the field's headline metric was Python-bound through early 2026.^[3]^[27]

Cost and reproducibility

A full Verified evaluation is resource-intensive: at minimum 120 GB of free disk space, 16 GB of RAM, and 8 CPU cores, with 32 GB recommended for parallel execution. Cloud evaluation through Modal removes the local hardware burden but introduces direct dollar cost. A reasoning-heavy agent run with Claude Opus 4.7 and 200K thinking tokens per task can spend $10 to $20 per attempt, which makes top-of-leaderboard reproductions expensive even for well-funded labs.^[2]^[11]

Saturation

The most consequential limitation is the one OpenAI cited in February 2026: with public scores in the high 80s and Anthropic's internal Claude Mythos at 93.9%, the gap between frontier models and the score ceiling is now within harness noise. Verified can still discriminate between weaker models, but the lab competition that defined the late-2024 to late-2025 era has effectively ended.^[3]

References

OpenAI. (August 13, 2024). "Introducing SWE-bench Verified." https://openai.com/index/introducing-swe-bench-verified/ ↩
SWE-bench Verified official page. https://www.swebench.com/verified.html ↩
OpenAI. (February 23, 2026). "Why SWE-bench Verified no longer measures frontier coding capabilities." Mia Glaese and Olivia Watkins. https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/ ↩
llm-stats.com. SWE-bench Verified leaderboard (April 2026). https://llm-stats.com/benchmarks/swe-bench-verified ↩
Marco Patzelt. (April 2026). "SWE-Bench Verified Leaderboard, Claude Opus 4.7 Leads." https://www.marc0.dev/en/leaderboard ↩
TokenMix. (April 2026). "SWE-Bench 2026: Claude Opus 4.7 Wins 87.6% vs GPT-5.3 85.0%." https://tokenmix.ai/blog/swe-bench-2026-claude-opus-4-7-wins ↩
OpenLM.ai. (October 2024). "SWE-Bench+: Enhanced Coding Benchmark for LLMs." arXiv:2410.06992. https://openlm.ai/swe-bench/ ↩
OpenAI. (2024). "SWE-b Annotation Instructions." Annotator-facing PDF. https://cdn.openai.com/introducing-swe-bench-verified/swe-b-annotation-instructions.pdf ↩
Epoch AI. "SWE-bench Verified." https://epoch.ai/benchmarks/swe-bench-verified ↩
Epoch AI. (2024). "What skills does SWE-bench Verified evaluate?" https://epoch.ai/blog/what-skills-does-swe-bench-verified-evaluate ↩
SWE-bench Docker setup and cloud evaluation guide. https://www.swebench.com/SWE-bench/guides/docker_setup/ ↩
Scale Labs. SWE-Bench Pro Leaderboard (Public). https://labs.scale.com/leaderboard/swe_bench_pro_public ↩
Anthropic. (October 2024). "Raising the bar on SWE-bench Verified with Claude 3.5 Sonnet." https://www.anthropic.com/research/swe-bench-sonnet ↩
Cognition Labs. (2024). "SWE-bench Technical Report." https://cognition.ai/blog/swe-bench-technical-report ↩
SWE-bench leaderboard archive. https://www.swebench.com/ ↩
"Agentless: Demystifying LLM-based Software Engineering Agents." UIUC and Princeton, 2024. ↩
Yang, J. et al. (2024). "SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering." NeurIPS 2024. https://swe-agent.com/ ↩
OpenHands (formerly OpenDevin) project page. https://github.com/All-Hands-AI/OpenHands ↩
Aider benchmarks. https://aider.chat/docs/leaderboards/ ↩
Anthropic. (2025). Claude Code documentation. https://docs.anthropic.com/en/docs/claude-code ↩
AutoCodeRover. "Autonomous Program Improvement." arXiv:2404.05427. ↩
Moatless Tools repository. https://github.com/aorwall/moatless-tools ↩
blockchain.news. (2026). "OpenAI Abandons SWE-bench Verified After Finding 59% of Failed Tests Were Flawed." https://blockchain.news/news/openai-abandons-swe-bench-verified-contamination-flawed-tests ↩
Latent Space. (2026). "The End of SWE-Bench Verified, with Mia Glaese & Olivia Watkins." https://www.latent.space/p/swe-bench-dead ↩
CodeSOTA. (2026). "Is SWE-bench Verified Contaminated? OpenAI Shifts to SWE-bench Pro." https://www.codesota.com/news/swe-bench-contamination-debate ↩
OpenAI. (February 2025). "Introducing the SWE-Lancer Benchmark." https://openai.com/index/swe-lancer/ ↩
Yang, J. et al. (2024). "SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?" arXiv:2410.03859. ↩
"Are 'Solved Issues' in SWE-bench Really Solved Correctly? An Empirical Study." arXiv:2503.15223. ↩
MacRumors. (May 28, 2026). "Anthropic Launches Claude Opus 4.8 With Gains in Coding and Honesty." https://www.macrumors.com/2026/05/28/anthropic-claude-opus-4-8/ ↩
Steel.dev. "SWE-bench Verified Leaderboard 2026: Latest Coding Agent Scores" (snapshot updated June 12, 2026). https://leaderboard.steel.dev/leaderboards/swe-bench-verified/ ↩

External links

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

5 revisions by 1 contributors · full history

Suggest edit