# SWE-bench Verified

> Source: https://aiwiki.ai/wiki/swe-bench_verified
> Updated: 2026-06-21
> Categories: AI Benchmarks, AI Code Generation, Model Evaluation
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

| SWE-bench Verified |
| --- |
| Overview |
| Full name | Software Engineering Benchmark, Verified subset |
| Abbreviation | SWE-bench Verified |
| Description | A 500-task, human-validated subset of [SWE-bench](/wiki/swe-bench) for evaluating AI agents on real-world GitHub issue resolution |
| Release date | 2024-08-13 |
| Latest version | 1.0 |
| Authors | OpenAI Preparedness team in collaboration with the Princeton SWE-bench team |
| Organization | [OpenAI](/wiki/openai), Princeton University |
| Technical Details |
| Type | Code generation, bug fixing, software engineering |
| Modality | Text, code |
| Task format | GitHub issue resolution with unit-test grading |
| Number of tasks | 500 (filtered from 1,699 candidates drawn from the original 2,294-instance SWE-bench) |
| Repositories | 12 popular Python projects (Django, SymPy, [scikit-learn](/wiki/scikit-learn), Sphinx, [Matplotlib](/wiki/matplotlib), pytest, xarray, astropy, pylint, requests, seaborn, Flask) |
| Evaluation metric | Resolve rate (% Resolved); FAIL_TO_PASS and PASS_TO_PASS test grading |
| Languages | Python |
| Performance |
| Initial baseline (Aug 2024) | 33.2% (GPT-4o + Agentless) |
| Public SOTA | 88.6% ([Claude](/wiki/claude) Opus 4.8, May 2026) |
| Internal SOTA reported | 93.9% (Claude Mythos, [Anthropic](/wiki/anthropic), 2026) |
| Status | Deprecated by [OpenAI](/wiki/openai) for frontier evaluation on February 23, 2026 |
| Resources |
| Website | [Official page](https://www.swebench.com/verified.html) |
| Announcement | [OpenAI blog](https://openai.com/index/introducing-swe-bench-verified/) |
| Deprecation post | [OpenAI blog (Feb 2026)](https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/) |
| Dataset | [Hugging Face](https://huggingface.co/datasets/princeton-nlp/SWE-bench_Verified) |
| Annotation guide | [SWE-b Annotation Instructions (PDF)](https://cdn.openai.com/introducing-swe-bench-verified/swe-b-annotation-instructions.pdf) |
| License | MIT |
| Predecessor | [SWE-bench](/wiki/swe-bench) |

**SWE-bench Verified** is a 500-problem, human-validated subset of the [SWE-bench](/wiki/swe-bench) software engineering [benchmark](/wiki/benchmark), released on August 13, 2024 by [OpenAI](/wiki/openai)'s Preparedness team together with the original Princeton SWE-bench authors. Each task asks an AI agent to read a real GitHub issue from one of 12 popular open-source Python repositories, edit the codebase, and produce a patch that is graded automatically against hidden unit tests. What sets Verified apart from the parent benchmark is that 93 contracted software developers individually reviewed every candidate task to confirm that the problem statement is unambiguous, the unit tests fairly grade a correct solution, and the fix is achievable in the harness time budget.[1] From late 2024 through early 2026 it was the single most-cited coding benchmark in frontier model launches, until OpenAI itself deprecated it for frontier evaluation on February 23, 2026 over test flaws and training-data contamination.[2][3]

Where the original SWE-bench's 2,294 instances were noisy enough that small score differences were hard to interpret, Verified produced a clean leaderboard signal that top labs raced to climb. The first published number was 33.2% for [GPT-4o](/wiki/gpt4) paired with the Agentless scaffold in August 2024, which roughly doubled Agentless's 16% on the full benchmark once the noisiest tasks were removed.[1] The public state-of-the-art then rose to 49% by October 2024, past 80% in November 2025, and to 88.6% for [Claude](/wiki/claude) Opus 4.8 in May 2026, with [Anthropic](/wiki/anthropic)'s internal Claude Mythos posting 93.9%, prompting widespread agreement that Verified had saturated.[3][4][29] OpenAI's deprecation post argued that gains in this range were no longer meaningful: more than 59% of the hardest unsolved tasks in their audit had broken or unfair tests, and every frontier model tested could reproduce verbatim portions of the dataset.[3]

Despite the deprecation, SWE-bench Verified remains widely used as a sanity check, an instructional benchmark, and a reference point against which newer evaluations like [SWE-bench Pro](/wiki/swe-bench), SWE-Lancer, and SWE-bench Multimodal are calibrated. Its design choices, especially the 93-annotator review process and the FAIL_TO_PASS / PASS_TO_PASS grading scheme, set the template that successor benchmarks have refined rather than replaced.[1][3]

## What is SWE-bench Verified used for?

SWE-bench Verified shares its core mechanics with the original [SWE-bench](/wiki/swe-bench): each task pairs a real GitHub issue with the codebase state immediately before the human-written fix was merged, plus a set of unit tests that distinguish the buggy state from the fixed state. An AI agent is given the issue text and the repository, must produce a code patch, and is graded on whether the patch flips the failing tests to passing without breaking the previously passing tests. What distinguishes Verified is the curation layer on top.

The core differences from SWE-bench are quality, scale, and intended use. The original benchmark sampled all 2,294 issue-PR pairs the team could mine from 12 repositories. Verified shrinks the set to 500 instances that an experienced human annotator confirmed are well-specified, fairly tested, and solvable. The result is a benchmark where a passing patch can be confidently interpreted as a real fix rather than a coincidence with a thin or buggy test suite.[1]

Verified was released alongside the OpenAI Preparedness Framework's broader effort to measure dangerous autonomous capability in frontier models. The Preparedness team viewed software engineering ability as a leading indicator of model autonomy, and Verified was created so that the autonomy score would not be polluted by ambiguous tasks. The benchmark therefore plays a dual role in the AI safety and capabilities literature: it is both a standard product comparison and an input to OpenAI's risk assessments.[1][2]

## Origin and motivation

### What problem did the original SWE-bench have?

When [SWE-bench](/wiki/swe-bench) launched in October 2023, the headline result was that even the strongest available model, [Claude](/wiki/claude) 2 with BM25 retrieval, resolved only 1.96% of tasks. Through 2024, agent scaffolds like SWE-agent and commercial products like [Devin](/wiki/devin) drove that number above 13%, but the leaderboard was already showing strange behavior. Some tasks were trivially passable through patterns that did not actually fix the underlying bug. Others were essentially impossible because the test suite checked for behavior the issue had never specified.

A later analysis by the SWE-bench+ team at OpenLM.ai estimated that roughly 32.67% of "successful" patches on the original benchmark involved solution leakage in the issue text or comments, and that approximately 60% of resolved instances showed some form of leakage when broader cues were considered. After filtering, the SWE-agent + GPT-4 score dropped from 12.47% to 3.97%, a roughly threefold reduction. Independent reviewers had also flagged tests that rejected functionally correct solutions because they checked for the exact wording or formatting used in the gold patch.[1][7]

OpenAI's Preparedness team needed a more reliable measurement to feed into its Model Autonomy risk evaluations. The team also wanted a benchmark whose score could be cited in product launches without each lab having to caveat that the underlying tasks were noisy. Both goals pointed in the same direction: human-validate a subset of SWE-bench and publish a clean signal.

### Who created SWE-bench Verified and when?

In early 2024, OpenAI Preparedness coordinated with the original SWE-bench authors at Princeton (Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan) on a curation effort that shipped on August 13, 2024. The goals were stated plainly in OpenAI's announcement, which described the three things human annotators were asked to ensure: "sample descriptions are well-specified and not too underspecified or otherwise unfair; ... unit tests correctly cover the intended solution; ... development environments can be reliably set up."[1] The same post summarized the curation result, that "we worked with 93 software developers experienced in Python to manually screen SWE-bench samples for quality, and created SWE-bench Verified, a subset of the original test set ... consisting of 500 samples verified to be non-problematic."[1]

Verified inherits the upstream benchmark's MIT license and uses the same Docker harness and grading scheme, so existing SWE-bench tooling continues to work. The two teams also coordinated on the official `princeton-nlp/SWE-bench_Verified` Hugging Face dataset and on hosting a separate Verified leaderboard at swebench.com.[1][2]

## Methodology

### Annotator pool

OpenAI contracted 93 software developers experienced in Python to perform the annotation. Each annotator had to demonstrate working familiarity with the relevant repository ecosystems (web frameworks, scientific computing, data visualization). The pool was drawn from a vendor that supplied technical contractors to OpenAI for related Preparedness evaluations, and annotators were paid for their time rather than incentivized by quality bonuses, which the team argued kept ratings calibrated.[1][2]

### Sampling and triple review

From the 2,294-instance SWE-bench test set, OpenAI selected 1,699 random samples to label. Every sample was reviewed by three independent annotators using a structured rubric. The triple-review design was meant to surface disagreement on borderline cases without leaning on any single reviewer's judgment.[1][2]

### Annotation rubric

Annotators rated each task along four axes, with each axis scored on a 0 to 3 severity scale where 0 and 1 are minor concerns and 2 or 3 means the sample should be discarded:[1][8]

| Axis | Question being answered | Discard condition |
| --- | --- | --- |
| Problem clarity | Is the GitHub issue well specified, with enough information for a developer to know what behavior is expected? | Severity 2 or 3 from any reviewer |
| Test fairness | Do the FAIL_TO_PASS and PASS_TO_PASS tests check the right thing without rejecting valid alternative solutions? | Severity 2 or 3 from any reviewer |
| Difficulty plausibility | Could an experienced developer realistically resolve the task within the time budget the harness allows? | Severity 2 or 3 from any reviewer |
| Other major issues | Any further blocker (broken environment, dependency drift, ambiguous spec) that would invalidate evaluation | Any reviewer flag |

Alongside the discard rubric, annotators estimated how long an experienced developer would need to decide on and implement the fix, given the cleaned issue text. Those estimates feed the difficulty distribution discussed below.[1][2]

### Filtering outcomes

The annotation pass produced striking quality numbers. About 38.3% of the 1,699 candidates were flagged for underspecified problem statements, and 61.1% were flagged for unit tests that may unfairly mark valid solutions as incorrect. In total, roughly 68.3% of the original SWE-bench samples were judged inadequate and removed. The remaining 500 instances form SWE-bench Verified.[1][3][7]

| Filtering stage | Count | Notes |
| --- | --- | --- |
| Original SWE-bench test set | 2,294 | Source for the candidate pool |
| Random sample for human review | 1,699 | Each reviewed by three annotators |
| Tasks failing problem-clarity rubric | ~651 (38.3%) | At least one reviewer flagged severity 2 or 3 |
| Tasks failing test-fairness rubric | ~1,039 (61.1%) | Categories overlap; many tasks failed multiple criteria |
| Tasks discarded for any reason | ~1,160 (68.3%) | Combined effect of all rubric axes |
| Final SWE-bench Verified | 500 | Released August 13, 2024 |

### Difficulty distribution

Using the annotators' time estimates, the SWE-bench Verified team published a difficulty breakdown. Most tasks are short, which has been both a feature (cheap to evaluate) and a criticism (does not capture multi-day projects).[1][9]

| Difficulty | Estimated time to solve | Number of tasks | Share of dataset |
| --- | --- | --- | --- |
| Easy | < 15 minutes | 196 | 39.2% |
| Medium | 15 minutes to 1 hour | 259 | 51.8% |
| Hard | 1 to 4 hours | ~42 | ~8.4% |
| Very hard | > 4 hours | 3 | 0.6% |

A companion analysis by Epoch AI confirmed the skew, finding that 91% of Verified tasks could be completed in under an hour and that the median gold patch changes only a handful of lines of code. Easy fixes average around 5 changed lines, while the longer tasks average closer to 50 changed lines. Only 14.2% of instances require edits to more than one file.[9][10]

## How does SWE-bench Verified differ from the original SWE-bench?

Verified inherits the construction pipeline of [SWE-bench](/wiki/swe-bench) but differs in size, quality, and reporting role. The table below summarizes the key differences.

| Property | Original SWE-bench | SWE-bench Verified |
| --- | --- | --- |
| Release date | October 10, 2023 | August 13, 2024 |
| Lead organization | Princeton University | [OpenAI](/wiki/openai) Preparedness with Princeton |
| Number of tasks | 2,294 | 500 |
| Source repositories | 12 Python projects | Same 12 Python projects |
| Curation | None beyond automated test validation | 93 contracted annotators, triple review on 1,699 candidates |
| Discard rate | 0% (full set) | ~68.3% of reviewed candidates discarded |
| First published baseline | 1.96% ([Claude](/wiki/claude) 2, BM25 retrieval) | 33.2% ([GPT-4o](/wiki/gpt4) + Agentless scaffold) |
| Current public SOTA | Reported only on subsets | 88.6% ([Claude](/wiki/claude) Opus 4.8, May 2026) |
| Status | Maintained as historical reference | Deprecated by [OpenAI](/wiki/openai) for frontier evals (Feb 2026) |
| Recommended successor | [SWE-bench Pro](/wiki/swe-bench), SWE-bench Live | [SWE-bench Pro](/wiki/swe-bench) |

The core mechanical difference is curation. Same repositories, same harness, same FAIL_TO_PASS / PASS_TO_PASS grading. What changed was the confidence one could place in any given score. On Verified, an 80% resolve rate plausibly means the agent solved 400 problems whose fixes a panel of reviewers agreed on. On the full benchmark, an 80% resolve rate would have included tasks where the test suite let multiple non-equivalent patches pass, or where the issue text omitted information the agent had to infer from the gold patch.

## Repository and task structure

Like the parent benchmark, Verified draws from 12 Python open-source repositories. Each task includes the issue title and body, the codebase at the pre-fix commit, the dependency manifest, and the FAIL_TO_PASS and PASS_TO_PASS test sets needed to grade a candidate patch.

| Repository | Domain | Approximate share of Verified |
| --- | --- | --- |
| Django | Web framework | ~45% |
| SymPy | Symbolic mathematics | ~15% |
| Sphinx | Documentation generator | ~10% |
| [Matplotlib](/wiki/matplotlib) | Data visualization | ~8% |
| [scikit-learn](/wiki/scikit-learn) | [Machine learning](/wiki/machine_learning) library | ~7% |
| Flask | Web microframework | ~5% |
| Requests | HTTP library | ~3% |
| Pytest | Testing framework | ~2% |
| Astropy | Astronomy tools | ~2% |
| Xarray | N-D labeled arrays | ~1% |
| Seaborn | Statistical visualization | ~1% |
| Pylint | Code analysis | ~1% |

Django dominates because it has both a long history of high-quality issue tracking and a large, well-tested codebase that supplies many candidate tasks. The five largest repositories account for roughly 85% of the dataset, which mirrors the distribution in the parent SWE-bench. The skew has been criticized for over-indexing the benchmark on Django-flavored web framework idioms, though most of the strongest agent submissions report consistent performance across the repository mix.[1][9]

## Evaluation framework

### Containerized execution

Verified uses the same Docker harness as the broader SWE-bench family. Each task is associated with a specific environment image that fixes the Python version and the exact dependency versions present at the issue's commit. This containerization is what lets different labs compare scores: without pinned environments, NumPy or pytest version drift alone could swing scores by several percentage points.[2][11]

### How is a task graded?

A task is marked resolved if and only if every FAIL_TO_PASS test passes after applying the agent's patch and every PASS_TO_PASS test continues to pass. A patch that fixes the bug but breaks an unrelated regression test counts as a failure. A patch that does not apply cleanly because of whitespace drift also counts as a failure. The strictness of the grading is part of why Verified produces interpretable signals: an agent that achieves 80% has met both the bug-fix and the no-regression bars on 400 separate tasks.[1][2][11]

### Reporting conventions

The primary metric is **% Resolved**, the share of the 500 tasks for which the agent's patch passes the grading rule. Secondary metrics that became common in the late-2024 to early-2026 reporting cycle include:

- **Pass@k**: success rate when the agent gets k independent attempts.
- **Cost per resolved task**: dollar cost of the model and tools required to resolve a task on average.
- **Wall clock per task**: median seconds the agent runs before producing a patch.
- **Variance across reruns**: standard deviation of resolve rate across 3 to 5 evaluation seeds, since stochastic agents can swing several points run-to-run.

From mid-2025 onward, several leaderboards (notably swebench.com, llm-stats.com, and Scale's labs.scale.com) added cost columns next to the headline accuracy figure, partly in response to criticism that high-cost reasoning agents were inflating reported scores without delivering proportional value.[4][5][12]

## Historical leaderboard

Verified scores climbed faster than almost any other coding benchmark. The progression below tracks the public state-of-the-art at major checkpoints from launch through 2026.

| Date | Public best % Resolved | Model / agent | Organization | Notes |
| --- | --- | --- | --- | --- |
| Aug 2024 | 33.2% | [GPT-4o](/wiki/gpt4) + Agentless | [OpenAI](/wiki/openai) | First published baseline at launch |
| Aug 2024 | ~30% | [Devin](/wiki/devin) v0 | Cognition Labs | Topped early Verified leaderboard from Cognition |
| Sep 2024 | 45.0% | Previous SOTA agent (composite scaffolds) | Various | Pre-Claude 3.5 Sonnet plateau |
| Oct 2024 | 49.0% | [Claude](/wiki/claude) 3.5 Sonnet (new) | [Anthropic](/wiki/anthropic) | First single-model crossing of 49%[13] |
| Dec 2024 | 53.0% | OpenAI o1 + scaffold | [OpenAI](/wiki/openai) | Reasoning models begin to contribute |
| Feb 2025 | 62.3% | [Claude](/wiki/claude) 3.7 Sonnet | [Anthropic](/wiki/anthropic) | Extended thinking helps coding |
| Feb 2025 | 70.3% | [Claude](/wiki/claude) 3.7 Sonnet (extended thinking + custom scaffold) | [Anthropic](/wiki/anthropic) | First reported 70%+ Verified score |
| May 2025 | 64.93% | [Claude](/wiki/claude) Sonnet 4 | [Anthropic](/wiki/anthropic) | New Sonnet baseline |
| Aug 2025 | 74.5% | [Claude](/wiki/claude) Opus 4.1 | [Anthropic](/wiki/anthropic) | Headline number for many product launches |
| Aug 2025 | ~75% | [GPT-5](/wiki/gpt) (Codex scaffold) | [OpenAI](/wiki/openai) | OpenAI's first GPT-5 reporting on Verified |
| Nov 2025 | 80.9% | [Claude](/wiki/claude) Opus 4.5 | [Anthropic](/wiki/anthropic) | First confirmed 80%+ score |
| Feb 2026 | 85.0% | GPT-5.3 Codex | [OpenAI](/wiki/openai) | OpenAI's last headline Verified release |
| Feb 2026 | 80.6% | Gemini 3.1 Pro | Google | Google's tied best |
| Feb 2026 | 80.2% | MiniMax M2.5 | MiniMax | Open-weight tied score |
| Apr 2026 | 87.6% | [Claude](/wiki/claude) Opus 4.7 (1M context) | [Anthropic](/wiki/anthropic) | Public SOTA at time of OpenAI deprecation |
| May 2026 | 88.6% | [Claude](/wiki/claude) Opus 4.8 | [Anthropic](/wiki/anthropic) | Highest generally-available score (released May 28, 2026) |
| 2026 | 93.9% | Claude Mythos | [Anthropic](/wiki/anthropic) | Internal cybersecurity model; not generally available |

The trajectory looks like a textbook capability ramp. From August 2024 to May 2026, the public state-of-the-art among generally-available models rose from roughly 33% to 88.6%, an increase of more than 55 percentage points in under two years. Three forces drove the gain in roughly equal proportion: stronger base models (Claude 3.5 to 4.8, GPT-4o to GPT-5.3), better reasoning (extended thinking, chain-of-thought, self-verification), and more capable agent scaffolds (Agentless, SWE-agent, OpenHands, Claude Code).

### What is the highest SWE-bench Verified score?

As of June 2026, the highest score by a generally-available model is 88.6%, posted by [Anthropic](/wiki/anthropic)'s [Claude](/wiki/claude) Opus 4.8 (released May 28, 2026), which edged out the April 2026 mark of 87.6% set by Claude Opus 4.7.[29][30] Anthropic's internal Claude Mythos model is reported at 93.9%, but it is not generally available and is excluded from the public leaderboard.[3][29] The public top of the leaderboard is dominated by Anthropic: the steel.dev aggregation snapshot (updated June 12, 2026) lists Claude Opus 4.8 (88.6%), Claude Opus 4.7 (87.6%), Claude Opus 4.5 (80.9%), and Claude Opus 4.6 (80.8%) above the strongest non-Anthropic entries, which cluster near 80% (Gemini 3.1 Pro 80.6%, MiniMax M2.5 80.2%, GPT-5.2 80.0%).[30] Because OpenAI deprecated Verified for frontier evaluation in February 2026, these figures should be read as the closing state of the benchmark's competitive era rather than a live race.

### Early Verified leaderboard and the role of Cognition Devin

In the days immediately after the August 13, 2024 release, the publicly visible Verified leaderboard was topped by Cognition Labs' [Devin](/wiki/devin) agent. Cognition had spent the first half of 2024 building Devin around the original SWE-bench full set and was in a strong position to evaluate against the curated subset on day one. Devin's early Verified results were in the high 20s to low 30s, comparable to the GPT-4o + Agentless baseline OpenAI published in the same week. This brief period mattered for the benchmark's perception: it showed that the leaderboard would be contested by both research labs and commercial agent vendors, and it set the template for marketing-quality SWE-bench reporting that subsequent product launches followed.[14][15]

### Top-15 snapshot (April 2026)

The table below is a snapshot of the public SWE-bench Verified leaderboard from April 2026, sourced from the public swebench.com leaderboard, llm-stats.com aggregation, and lab-reported scores. Internal-only models like Claude Mythos are excluded.[4][5]

| Rank | Model | Organization | % Resolved | Notes |
| --- | --- | --- | --- | --- |
| 1 | [Claude](/wiki/claude) Opus 4.7 | [Anthropic](/wiki/anthropic) | 87.6% | 1M context, released April 16, 2026 |
| 2 | GPT-5.3 Codex | [OpenAI](/wiki/openai) | 85.0% | OpenAI's final Verified release before deprecation |
| 3 | [Claude](/wiki/claude) Opus 4.5 | [Anthropic](/wiki/anthropic) | 80.9% | First confirmed public 80%+ result |
| 4 | [Claude](/wiki/claude) Opus 4.6 | [Anthropic](/wiki/anthropic) | 80.8% | Mid-Q1 2026 release |
| 5 | Gemini 3.1 Pro | Google | 80.6% | Google DeepMind |
| 6 | MiniMax M2.5 | MiniMax | 80.2% | Open-weight |
| 7 | GPT-5.2 | [OpenAI](/wiki/openai) | 80.0% | Pre-Codex variant |
| 8 | [Claude](/wiki/claude) Sonnet 4.6 | [Anthropic](/wiki/anthropic) | 79.6% | Smaller, cheaper Anthropic model |
| 9 | Qwen3.6 Plus | Alibaba | 78.8% | Released April 2026 |
| 10 | Gemini 3 Flash | Google | 78.0% | Cheaper Google variant |
| 11 | MiMo-V2-Pro | Xiaomi | 78.0% | Open-source 1T-param model |
| 12 | GLM-5 | Zhipu AI | 77.8% | Open-source, 744B parameters |
| 13 | Muse Spark | Meta | 77.4% | Meta Superintelligence Labs flagship |
| 14 | [Claude](/wiki/claude) Sonnet 4.5 | [Anthropic](/wiki/anthropic) | 77.2% | Late-2025 release |
| 15 | Kimi K2.5 | Moonshot AI | 76.8% | Open-source |

In May 2026, after this snapshot, [Anthropic](/wiki/anthropic) released [Claude](/wiki/claude) Opus 4.8, which posted 88.6% on Verified and 69.2% on the recommended successor [SWE-bench Pro](/wiki/swe-bench), taking the top generally-available position.[29][30]

## Top scoring approaches and frameworks

The agent scaffolds that dominated the Verified leaderboard tended to combine an interactive shell or editor with a search tool, a test runner, and some form of self-verification. Several reference designs are worth singling out.

### Agentless

Agentless, developed jointly by researchers at the University of Illinois Urbana-Champaign and Princeton, eschews an autonomous agent loop in favor of a fixed three-stage pipeline: localize the bug, repair it, and validate the fix. OpenAI used Agentless paired with [GPT-4o](/wiki/gpt4) for the original 33.2% Verified baseline because the simple pipeline made it easy to attribute scores to model capability rather than scaffold engineering. The cleaned task set roughly doubled Agentless's 16% score on the full SWE-bench, and the scaffold remained the most popular reference for cheap, reproducible Verified runs through 2025.[1][16]

### SWE-agent

SWE-agent, the official Princeton agent paper published at NeurIPS 2024, introduced the Agent-Computer Interface (ACI): a curated set of shell commands that make repository navigation and editing easier for language models. SWE-agent's scaffolds powered many of the open-source Verified entries, including the Live-SWE-agent variant that reached 79.2% with [Claude](/wiki/claude) Opus 4.5 in late 2025.[17]

### OpenHands (formerly OpenDevin)

OpenHands consolidated several research scaffolds into a community-driven open-source platform supporting browser, shell, and editor tools. It became the de facto open-source counterweight to closed agent products like [Devin](/wiki/devin) and was widely used in academic Verified submissions. The framework supports multiple LLM backends and ships with reusable plugins for retrieval, multi-attempt sampling, and verifier-based selection.[18]

### Aider

[Aider](/wiki/aider), Paul Gauthier's open-source command-line coding assistant, runs benchmarks on Verified to compare different models on whole-file edits. It is mostly used as an interactive tool, but its public benchmark page played an outsized role in popularizing cost-adjusted reporting and the practice of comparing the same scaffold across many models.[19]

### Claude Code

[Claude Code](/wiki/claude_code), Anthropic's terminal-based agent introduced in early 2025, became the reference scaffold for Anthropic's own Verified numbers. Its bash tool, file editor, and computer-use tool let the underlying [Claude](/wiki/claude) model do most of the heavy lifting with a relatively shallow surrounding harness. Claude Code's design influenced how other labs approached agent engineering, especially the move toward giving the model more direct shell access rather than wrapping it in heavy intermediation.[20]

### RA-Aid, Moatless, and AutoCodeRover

A second tier of agents has held steady positions in the mid-leaderboard. RA-Aid is a research agent built on top of OpenHands with a focus on retrieval-augmented planning. Moatless Tools is a lightweight scaffold popular with budget-conscious researchers. AutoCodeRover, developed at the National University of Singapore, uses spectrum-based fault localization to narrow the search space before patch generation. Each contributed important ideas (decoupled localization, deterministic patches, coverage-guided edits) that later commercial agents absorbed.[21][22]

### Other agents commonly seen on the leaderboard

| Agent / framework | Developer | Approach |
| --- | --- | --- |
| Amazon Q Developer Agent | Amazon | Enterprise agent with AWS tooling |
| [Atlassian Rovo Dev](/wiki/rovo_dev) | Atlassian | Agentic coding inside Jira and Bitbucket |
| [Cursor](/wiki/cursor) Composer | Anysphere | IDE-based agent with human-in-the-loop edits |
| Codex | [OpenAI](/wiki/openai) | Cloud-based agent in sandboxed environments |
| Augment Code | Augment | Context-aware agent for large codebases |
| iSWE-Agent | IBM | Java-focused agent used on Multi-SWE-bench |
| CodeR | Independent | Multi-agent design with role specialization |

### Patterns common to the strongest submissions

Independent of any specific framework, the highest-scoring Verified entries through 2025 and 2026 tended to share a few patterns. They run multiple candidate patches and use a verifier (often an LLM-as-judge with regression testing) to pick the best one. They allow generous tool use (file search, AST queries, language servers, test runners) instead of restricting the agent to a fixed action set. They feed the test output back into the model in a tight loop so the agent can self-correct. And they explicitly handle whitespace and formatting drift in patch generation, which avoided the silent grading failures that plagued earlier submissions.

## Why did OpenAI deprecate SWE-bench Verified?

On February 23, 2026, OpenAI's Mia Glaese and Olivia Watkins published "Why SWE-bench Verified no longer measures frontier coding capabilities." The post argued that the benchmark had reached the end of its useful life as a frontier evaluation and recommended SWE-bench Pro as the new community standard. OpenAI's core conclusion was blunt: "improvements on SWE-bench Verified no longer reflect meaningful improvements in models' real-world software development abilities," and "instead, they increasingly reflect how much the model was exposed to the benchmark at training time."[3]

### Findings cited in the deprecation post

OpenAI's case rested on three pillars:

1. **Broken specifications dominate the unsolved tail.** OpenAI audited 138 problems that the lab's models repeatedly failed to solve, representing roughly 27.6% of the dataset. They reported that more than 60% of the audited problems were unsolvable as stated. Forty-nine tests were too narrow, rejecting functionally correct submissions; twenty-six were too wide, requiring features that were never mentioned in the issue. In total, about 59.4% of the hardest unsolved problems had flawed test cases.[3][23]
2. **Training-data contamination is now pervasive.** Every frontier model OpenAI tested could reproduce verbatim portions of the gold patch or problem statement when prompted with only a Verified Task ID. Models implicated in the audit included GPT-5.2, [Claude](/wiki/claude) Opus 4.5, and Gemini 3 Flash. In one example, GPT-5.2's chain-of-thought traces revealed knowledge of unspecified test requirements that could only have come from training data exposure.[3]
3. **Saturation has compressed the meaningful signal.** With public scores in the high 80s and an internal Anthropic score of 93.9%, the headroom remaining on Verified was within the noise of the harness. Marginal improvements no longer reflect generalizable capability; they reflect how much of the dataset the model was exposed to during training.[3]

### Recommendation

OpenAI recommended migrating to [SWE-bench Pro](/wiki/swe-bench), which uses 1,865 longer tasks across more diverse public, held-out, and commercial codebases. Pro tasks are explicitly chosen to require 1 to 4 hours of human effort or more, in contrast with Verified's bias toward sub-hour fixes. Initial Pro scores from frontier models ran 20 to 40 percentage points below their Verified scores, which OpenAI argued was the kind of measurement headroom that frontier evaluation requires. The gap held into mid-2026: [Claude](/wiki/claude) Opus 4.8 scored 88.6% on Verified but only 69.2% on SWE-bench Pro.[3][24][29]

### Industry response

The deprecation post was discussed extensively on the Latent Space podcast in the same week, with Glaese and Watkins arguing that the move was about measurement validity rather than a critique of Verified's design. Anthropic, Google DeepMind, and Meta did not immediately stop reporting Verified numbers, but most subsequent product launches paired the Verified score with a Pro score. Several outlets (CodeSOTA, blockchain.news, Latent Space, marc0.dev) wrote retrospectives within weeks framing the deprecation as a generational shift in how the field measures coding ability.[3][6][23][24][25]

## Successors and complementary benchmarks

OpenAI's deprecation accelerated work on a family of successor benchmarks that aim to restore measurement validity. The most influential are summarized below.

| Successor | Released by | Released | Size | Why it matters |
| --- | --- | --- | --- | --- |
| [SWE-bench Pro](/wiki/swe-bench) | Scale AI | 2025 | 1,865 long-horizon tasks across public, held-out, and commercial codebases | OpenAI's recommended successor; longer tasks, less contamination |
| SWE-bench Multimodal | Princeton SWE-bench team | October 2024 | 619 JavaScript tasks with embedded images | Tests visual grounding (UI screenshots, error images, diagrams) |
| SWE-bench Multilingual | Princeton SWE-bench team | 2025 | 300 tasks across 9 languages (C, C++, Go, Java, JS, TS, PHP, Ruby, Rust) | Breaks the Python monoculture |
| Multi-SWE-bench | ByteDance Seed | 2025 (NeurIPS 2025 D&B) | 1,632 instances across 7 languages | Independent multilingual effort |
| SWE-bench Live | Microsoft Research | May 2025 | 1,565+ instances; updated monthly | Anti-contamination via post-cutoff issues |
| SWE-Lancer | [OpenAI](/wiki/openai) | February 2025 | 1,400+ Upwork tasks ($1M payouts) | Dollars-earned metric; freelance simulation |
| SWE-rebench | Independent | 2025 | Continuously updated Python set | Decontamination focus |
| SWE-bench+ | OpenLM.ai | October 2024 | Filtered subset of original SWE-bench | Removed leaked instances |

### SWE-Lancer

[OpenAI](/wiki/openai) launched SWE-Lancer in February 2025 to evaluate models on real freelance software engineering tasks scraped from Upwork and verified by Expensify. The benchmark covers more than 1,400 tasks with payouts totaling roughly $1 million in real dollars, ranging from $50 bug fixes to $32,000 feature implementations. Tasks split between independent contributor (IC) work and managerial decisions over technical proposals. Initial reported scores were modest: Claude 3.5 Sonnet completed 26.2% of IC tasks and 44.9% of managerial tasks, while OpenAI's o1 reached 20.3% and 46.3% respectively. The dollars-earned metric resonated outside the academic community and made SWE-Lancer popular with industry analysts.[26]

### SWE-bench Multimodal

Introduced in October 2024 (arXiv:2410.03859), SWE-bench Multimodal extends the harness to JavaScript repositories with image-bearing issues. The dataset contains 619 task instances drawn from 17 user-facing JavaScript repositories, with 862 images embedded across the problem statements. Image categories include code screenshots (194 instances), diagrams (107), error messages (54), digital art (38), maps (35), and data visualizations (28). Multimodal scores have lagged Verified scores by a wide margin because the task requires both visual grounding and code reasoning, and JavaScript test frameworks (Jest, Mocha, Playwright) make patch validation more involved than the Python set.[27]

## Limitations

### Test suite weaknesses

OpenAI's deprecation audit confirmed earlier independent findings that Verified's test suites are not always trustworthy. An empirical study published as arXiv:2503.15223 reported that more than 15% of Verified instances have incomplete test patches that allow incorrect or partial solutions to pass. Specifically, 12.50% of passing patches were judged functionally or semantically incorrect, and 9.82% were incomplete. Frameworks like UTBoost and PatchDiff suggested leaderboard scores may be inflated by 6 to 7 percentage points due to test inadequacies. OpenAI's own audit reported even higher rates of broken specifications among the hardest unsolved tasks.[3][28]

### Data contamination

Verified inherits the parent benchmark's contamination problem. More than 94% of SWE-bench issues were filed before the knowledge cutoff dates of major pre-trained language models. Subsequent audits, including the OpenLM.ai SWE-bench+ analysis and OpenAI's February 2026 study, demonstrated that frontier models could regenerate parts of the gold patch or problem statement when prompted with only a Task ID. Even SWE-bench Live's monthly refresh did not entirely solve the problem because tasks rotate into model training data on a similar timescale.[3][7]

### Repository selection bias

The 12 source repositories are all open-source Python projects with strong test cultures. Many real-world codebases have sparse tests, proprietary dependencies, or architectural patterns not represented in this set. As a result, high Verified scores do not necessarily predict performance on arbitrary production codebases. The Django dominance in particular gives the dataset a web framework flavor that is over-represented compared with the broader software ecosystem.[9][10]

### Task-length skew

Epoch AI's analysis showed that the majority of Verified tasks are relatively simple. About 91% can be completed by a human in under one hour and 39.2% in under 15 minutes. The benchmark therefore primarily measures an agent's ability to fix straightforward bugs rather than tackle architectural changes or large feature implementations. This skew is the single biggest reason successor benchmarks like SWE-bench Pro and SWE-Lancer reset the difficulty floor.[9][10]

### Python-only coverage

Verified is Python-only. Performance on Verified does not reliably generalize to JavaScript, Java, C++, Go, or Rust. SWE-bench Multilingual, Multi-SWE-bench, and SWE-bench Live partially address this gap, but Verified's outsized leaderboard role meant the field's headline metric was Python-bound through early 2026.[3][27]

### Cost and reproducibility

A full Verified evaluation is resource-intensive: at minimum 120 GB of free disk space, 16 GB of RAM, and 8 CPU cores, with 32 GB recommended for parallel execution. Cloud evaluation through Modal removes the local hardware burden but introduces direct dollar cost. A reasoning-heavy agent run with [Claude](/wiki/claude) Opus 4.7 and 200K thinking tokens per task can spend $10 to $20 per attempt, which makes top-of-leaderboard reproductions expensive even for well-funded labs.[2][11]

### Saturation

The most consequential limitation is the one OpenAI cited in February 2026: with public scores in the high 80s and Anthropic's internal Claude Mythos at 93.9%, the gap between frontier models and the score ceiling is now within harness noise. Verified can still discriminate between weaker models, but the lab competition that defined the late-2024 to late-2025 era has effectively ended.[3]

## See also

- [Global-MMLU](/wiki/global_mmlu)
- [FormulaOne](/wiki/formulaone)
- [MMLU-Redux](/wiki/mmlu_redux)
- [PutnamBench](/wiki/putnambench)
- [ScienceAgentBench](/wiki/scienceagentbench)
- [SWE-bench](/wiki/swe-bench)
- [HumanEval](/wiki/humaneval)
- [LiveCodeBench](/wiki/livecodebench)
- [SWE-agent](/wiki/swe-agent)
- [Devin](/wiki/devin)
- [Aider](/wiki/aider)
- [Cursor](/wiki/cursor)
- [Claude Code](/wiki/claude_code)
- [Claude](/wiki/claude)
- [GPT-4](/wiki/gpt4)
- [Anthropic](/wiki/anthropic)
- [OpenAI](/wiki/openai)
- [Code generation](/wiki/ai_code_generation)
- [Benchmark](/wiki/benchmark)

## References

[1] OpenAI. (August 13, 2024). "Introducing SWE-bench Verified." https://openai.com/index/introducing-swe-bench-verified/

[2] SWE-bench Verified official page. https://www.swebench.com/verified.html

[3] OpenAI. (February 23, 2026). "Why SWE-bench Verified no longer measures frontier coding capabilities." Mia Glaese and Olivia Watkins. https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/

[4] llm-stats.com. SWE-bench Verified leaderboard (April 2026). https://llm-stats.com/benchmarks/swe-bench-verified

[5] Marco Patzelt. (April 2026). "SWE-Bench Verified Leaderboard, Claude Opus 4.7 Leads." https://www.marc0.dev/en/leaderboard

[6] TokenMix. (April 2026). "SWE-Bench 2026: Claude Opus 4.7 Wins 87.6% vs GPT-5.3 85.0%." https://tokenmix.ai/blog/swe-bench-2026-claude-opus-4-7-wins

[7] OpenLM.ai. (October 2024). "SWE-Bench+: Enhanced Coding Benchmark for LLMs." arXiv:2410.06992. https://openlm.ai/swe-bench/

[8] OpenAI. (2024). "SWE-b Annotation Instructions." Annotator-facing PDF. https://cdn.openai.com/introducing-swe-bench-verified/swe-b-annotation-instructions.pdf

[9] Epoch AI. "SWE-bench Verified." https://epoch.ai/benchmarks/swe-bench-verified

[10] Epoch AI. (2024). "What skills does SWE-bench Verified evaluate?" https://epoch.ai/blog/what-skills-does-swe-bench-verified-evaluate

[11] SWE-bench Docker setup and cloud evaluation guide. https://www.swebench.com/SWE-bench/guides/docker_setup/

[12] Scale Labs. SWE-Bench Pro Leaderboard (Public). https://labs.scale.com/leaderboard/swe_bench_pro_public

[13] Anthropic. (October 2024). "Raising the bar on SWE-bench Verified with Claude 3.5 Sonnet." https://www.anthropic.com/research/swe-bench-sonnet

[14] Cognition Labs. (2024). "SWE-bench Technical Report." https://cognition.ai/blog/swe-bench-technical-report

[15] SWE-bench leaderboard archive. https://www.swebench.com/

[16] "Agentless: Demystifying LLM-based Software Engineering Agents." UIUC and Princeton, 2024.

[17] Yang, J. et al. (2024). "SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering." NeurIPS 2024. https://swe-agent.com/

[18] OpenHands (formerly OpenDevin) project page. https://github.com/All-Hands-AI/OpenHands

[19] Aider benchmarks. https://aider.chat/docs/leaderboards/

[20] Anthropic. (2025). Claude Code documentation. https://docs.anthropic.com/en/docs/claude-code

[21] AutoCodeRover. "Autonomous Program Improvement." arXiv:2404.05427.

[22] Moatless Tools repository. https://github.com/aorwall/moatless-tools

[23] blockchain.news. (2026). "OpenAI Abandons SWE-bench Verified After Finding 59% of Failed Tests Were Flawed." https://blockchain.news/news/openai-abandons-swe-bench-verified-contamination-flawed-tests

[24] Latent Space. (2026). "The End of SWE-Bench Verified, with Mia Glaese & Olivia Watkins." https://www.latent.space/p/swe-bench-dead

[25] CodeSOTA. (2026). "Is SWE-bench Verified Contaminated? OpenAI Shifts to SWE-bench Pro." https://www.codesota.com/news/swe-bench-contamination-debate

[26] OpenAI. (February 2025). "Introducing the SWE-Lancer Benchmark." https://openai.com/index/swe-lancer/

[27] Yang, J. et al. (2024). "SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?" arXiv:2410.03859.

[28] "Are 'Solved Issues' in SWE-bench Really Solved Correctly? An Empirical Study." arXiv:2503.15223.

[29] MacRumors. (May 28, 2026). "Anthropic Launches Claude Opus 4.8 With Gains in Coding and Honesty." https://www.macrumors.com/2026/05/28/anthropic-claude-opus-4-8/

[30] Steel.dev. "SWE-bench Verified Leaderboard 2026: Latest Coding Agent Scores" (snapshot updated June 12, 2026). https://leaderboard.steel.dev/leaderboards/swe-bench-verified/

## External links

- [SWE-bench Verified official page](https://www.swebench.com/verified.html)
- [Introducing SWE-bench Verified (OpenAI, August 2024)](https://openai.com/index/introducing-swe-bench-verified/)
- [Why SWE-bench Verified no longer measures frontier coding capabilities (OpenAI, February 2026)](https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/)
- [SWE-bench Verified dataset on Hugging Face](https://huggingface.co/datasets/princeton-nlp/SWE-bench_Verified)
- [SWE-b Annotation Instructions (PDF)](https://cdn.openai.com/introducing-swe-bench-verified/swe-b-annotation-instructions.pdf)
- [SWE-Bench Verified Leaderboard (llm-stats.com)](https://llm-stats.com/benchmarks/swe-bench-verified)
- [Epoch AI SWE-bench Verified analysis](https://epoch.ai/benchmarks/swe-bench-verified)
- [SWE-agent project](https://swe-agent.com/)
- [Latent Space, The End of SWE-Bench Verified](https://www.latent.space/p/swe-bench-dead)