SWE-bench Verified
Last reviewed
Apr 28, 2026
Sources
28 citations
Review status
Source-backed
Revision
v3 ยท 5,842 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 28, 2026
Sources
28 citations
Review status
Source-backed
Revision
v3 ยท 5,842 words
Add missing citations, update stale details, or suggest a clearer explanation.
| SWE-bench Verified | |
|---|---|
| Overview | |
| Full name | Software Engineering Benchmark, Verified subset |
| Abbreviation | SWE-bench Verified |
| Description | A 500-task, human-validated subset of SWE-bench for evaluating AI agents on real-world GitHub issue resolution |
| Release date | 2024-08-13 |
| Latest version | 1.0 |
| Authors | OpenAI Preparedness team in collaboration with the Princeton SWE-bench team |
| Organization | OpenAI, Princeton University |
| Technical Details | |
| Type | Code generation, bug fixing, software engineering |
| Modality | Text, code |
| Task format | GitHub issue resolution with unit-test grading |
| Number of tasks | 500 (filtered from 1,699 candidates drawn from the original 2,294-instance SWE-bench) |
| Repositories | 12 popular Python projects (Django, SymPy, scikit-learn, Sphinx, Matplotlib, pytest, xarray, astropy, pylint, requests, seaborn, Flask) |
| Evaluation metric | Resolve rate (% Resolved); FAIL_TO_PASS and PASS_TO_PASS test grading |
| Languages | Python |
| Performance | |
| Initial baseline (Aug 2024) | 33.2% (GPT-4o + Agentless) |
| Public SOTA | 87.6% (Claude Opus 4.7, April 2026) |
| Internal SOTA reported | 93.9% (Claude Mythos Preview, Anthropic, 2026) |
| Status | Deprecated by OpenAI for frontier evaluation on February 23, 2026 |
| Resources | |
| Website | Official page |
| Announcement | OpenAI blog |
| Deprecation post | OpenAI blog (Feb 2026) |
| Dataset | Hugging Face |
| Annotation guide | SWE-b Annotation Instructions (PDF) |
| License | MIT |
| Predecessor | SWE-bench |
SWE-bench Verified is a human-validated subset of the SWE-bench software engineering benchmark, released on August 13, 2024 by OpenAI's Preparedness team in collaboration with the original Princeton SWE-bench authors. It contains 500 GitHub issues drawn from 12 popular open-source Python repositories, with every task individually reviewed by 93 contracted software developers to confirm that the problem description is unambiguous, the unit tests fairly grade correct solutions, and the issue is solvable in the time budget the harness allows. The Verified subset became the dominant reporting target for frontier coding models from late 2024 through early 2026, before OpenAI itself deprecated it on February 23, 2026 due to test flaws and training-data contamination.[1][2][3]
Where the original SWE-bench's 2,294 instances were noisy enough that small score differences were hard to interpret, Verified produced a clean leaderboard signal that top labs raced to climb. The first published number was 33.2% for GPT-4o paired with the Agentless scaffold in August 2024.[1] By April 2026, Claude Opus 4.7 had reached 87.6% on the public leaderboard, and Anthropic's internal Claude Mythos Preview had posted 93.9%, prompting widespread agreement that Verified had saturated.[4][5][6] OpenAI's deprecation post argued that gains above the high 80s were no longer meaningful: more than 59% of the hardest unsolved tasks in their audit had broken or unfair tests, and every frontier model tested could reproduce verbatim portions of the dataset.[3]
Despite the deprecation, SWE-bench Verified remains widely used as a sanity check, an instructional benchmark, and a reference point against which newer evaluations like SWE-bench Pro, SWE-Lancer, and SWE-bench Multimodal are calibrated. Its design choices, especially the 93-annotator review process and the FAIL_TO_PASS / PASS_TO_PASS grading scheme, set the template that successor benchmarks have refined rather than replaced.[1][3]
SWE-bench Verified shares its core mechanics with the original SWE-bench: each task pairs a real GitHub issue with the codebase state immediately before the human-written fix was merged, plus a set of unit tests that distinguish the buggy state from the fixed state. An AI agent is given the issue text and the repository, must produce a code patch, and is graded on whether the patch flips the failing tests to passing without breaking the previously passing tests. What distinguishes Verified is the curation layer on top.
The core differences from SWE-bench are quality, scale, and intended use. The original benchmark sampled all 2,294 issue-PR pairs the team could mine from 12 repositories. Verified shrinks the set to 500 instances that an experienced human annotator confirmed are well-specified, fairly tested, and solvable. The result is a benchmark where a passing patch can be confidently interpreted as a real fix rather than a coincidence with a thin or buggy test suite.[1]
Verified was released alongside the OpenAI Preparedness Framework's broader effort to measure dangerous autonomous capability in frontier models. The Preparedness team viewed software engineering ability as a leading indicator of model autonomy, and Verified was created so that the autonomy score would not be polluted by ambiguous tasks. The benchmark therefore plays a dual role in the AI safety and capabilities literature: it is both a standard product comparison and an input to OpenAI's risk assessments.[1][2]
When SWE-bench launched in October 2023, the headline result was that even the strongest available model, Claude 2 with BM25 retrieval, resolved only 1.96% of tasks. Through 2024, agent scaffolds like SWE-agent and commercial products like Devin drove that number above 13%, but the leaderboard was already showing strange behavior. Some tasks were trivially passable through patterns that did not actually fix the underlying bug. Others were essentially impossible because the test suite checked for behavior the issue had never specified.
A later analysis by the SWE-bench+ team at OpenLM.ai estimated that roughly 32.67% of "successful" patches on the original benchmark involved solution leakage in the issue text or comments, and that approximately 60% of resolved instances showed some form of leakage when broader cues were considered. After filtering, the SWE-agent + GPT-4 score dropped from 12.47% to 3.97%, a roughly threefold reduction. Independent reviewers had also flagged tests that rejected functionally correct solutions because they checked for the exact wording or formatting used in the gold patch.[1][7]
OpenAI's Preparedness team needed a more reliable measurement to feed into its Model Autonomy risk evaluations. The team also wanted a benchmark whose score could be cited in product launches without each lab having to caveat that the underlying tasks were noisy. Both goals pointed in the same direction: human-validate a subset of SWE-bench and publish a clean signal.
In early 2024, OpenAI Preparedness coordinated with the original SWE-bench authors at Princeton (Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan) on a curation effort. The goals were stated plainly in OpenAI's August 2024 announcement: to ensure problem statements are well-specified, to ensure unit tests do not unfairly mark valid solutions as incorrect, and to ensure development environments are stable enough that an agent's failures reflect agent limitations rather than infrastructure quirks.[1]
Verified inherits the upstream benchmark's MIT license and uses the same Docker harness and grading scheme, so existing SWE-bench tooling continues to work. The two teams also coordinated on the official princeton-nlp/SWE-bench_Verified Hugging Face dataset and on hosting a separate Verified leaderboard at swebench.com.[1][2]
OpenAI contracted 93 software developers experienced in Python to perform the annotation. Each annotator had to demonstrate working familiarity with the relevant repository ecosystems (web frameworks, scientific computing, data visualization). The pool was drawn from a vendor that supplied technical contractors to OpenAI for related Preparedness evaluations, and annotators were paid for their time rather than incentivized by quality bonuses, which the team argued kept ratings calibrated.[1][2]
From the 2,294-instance SWE-bench test set, OpenAI selected 1,699 random samples to label. Every sample was reviewed by three independent annotators using a structured rubric. The triple-review design was meant to surface disagreement on borderline cases without leaning on any single reviewer's judgment.[1][2]
Annotators rated each task along four axes, with each axis scored on a 0 to 3 severity scale where 0 and 1 are minor concerns and 2 or 3 means the sample should be discarded:[1][8]
| Axis | Question being answered | Discard condition |
|---|---|---|
| Problem clarity | Is the GitHub issue well specified, with enough information for a developer to know what behavior is expected? | Severity 2 or 3 from any reviewer |
| Test fairness | Do the FAIL_TO_PASS and PASS_TO_PASS tests check the right thing without rejecting valid alternative solutions? | Severity 2 or 3 from any reviewer |
| Difficulty plausibility | Could an experienced developer realistically resolve the task within the time budget the harness allows? | Severity 2 or 3 from any reviewer |
| Other major issues | Any further blocker (broken environment, dependency drift, ambiguous spec) that would invalidate evaluation | Any reviewer flag |
Alongside the discard rubric, annotators estimated how long an experienced developer would need to decide on and implement the fix, given the cleaned issue text. Those estimates feed the difficulty distribution discussed below.[1][2]
The annotation pass produced striking quality numbers. About 38.3% of the 1,699 candidates were flagged for underspecified problem statements, and 61.1% were flagged for unit tests that may unfairly mark valid solutions as incorrect. In total, roughly 68.3% of the original SWE-bench samples were judged inadequate and removed. The remaining 500 instances form SWE-bench Verified.[1][3][7]
| Filtering stage | Count | Notes |
|---|---|---|
| Original SWE-bench test set | 2,294 | Source for the candidate pool |
| Random sample for human review | 1,699 | Each reviewed by three annotators |
| Tasks failing problem-clarity rubric | ~651 (38.3%) | At least one reviewer flagged severity 2 or 3 |
| Tasks failing test-fairness rubric | ~1,039 (61.1%) | Categories overlap; many tasks failed multiple criteria |
| Tasks discarded for any reason | ~1,160 (68.3%) | Combined effect of all rubric axes |
| Final SWE-bench Verified | 500 | Released August 13, 2024 |
Using the annotators' time estimates, the SWE-bench Verified team published a difficulty breakdown. Most tasks are short, which has been both a feature (cheap to evaluate) and a criticism (does not capture multi-day projects).[1][9]
| Difficulty | Estimated time to solve | Number of tasks | Share of dataset |
|---|---|---|---|
| Easy | < 15 minutes | 196 | 39.2% |
| Medium | 15 minutes to 1 hour | 259 | 51.8% |
| Hard | 1 to 4 hours | ~42 | ~8.4% |
| Very hard | > 4 hours | 3 | 0.6% |
A companion analysis by Epoch AI confirmed the skew, finding that 91% of Verified tasks could be completed in under an hour and that the median gold patch changes only a handful of lines of code. Easy fixes average around 5 changed lines, while the longer tasks average closer to 50 changed lines. Only 14.2% of instances require edits to more than one file.[9][10]
Verified inherits the construction pipeline of SWE-bench but differs in size, quality, and reporting role. The table below summarizes the key differences.
| Property | Original SWE-bench | SWE-bench Verified |
|---|---|---|
| Release date | October 10, 2023 | August 13, 2024 |
| Lead organization | Princeton University | OpenAI Preparedness with Princeton |
| Number of tasks | 2,294 | 500 |
| Source repositories | 12 Python projects | Same 12 Python projects |
| Curation | None beyond automated test validation | 93 contracted annotators, triple review on 1,699 candidates |
| Discard rate | 0% (full set) | ~68.3% of reviewed candidates discarded |
| First published baseline | 1.96% (Claude 2, BM25 retrieval) | 33.2% (GPT-4o + Agentless scaffold) |
| Current public SOTA (Apr 2026) | Reported only on subsets | 87.6% (Claude Opus 4.7) |
| Status | Maintained as historical reference | Deprecated by OpenAI for frontier evals (Feb 2026) |
| Recommended successor | SWE-bench Pro, SWE-bench Live | SWE-bench Pro |
The core mechanical difference is curation. Same repositories, same harness, same FAIL_TO_PASS / PASS_TO_PASS grading. What changed was the confidence one could place in any given score. On Verified, an 80% resolve rate plausibly means the agent solved 400 problems whose fixes a panel of reviewers agreed on. On the full benchmark, an 80% resolve rate would have included tasks where the test suite let multiple non-equivalent patches pass, or where the issue text omitted information the agent had to infer from the gold patch.
Like the parent benchmark, Verified draws from 12 Python open-source repositories. Each task includes the issue title and body, the codebase at the pre-fix commit, the dependency manifest, and the FAIL_TO_PASS and PASS_TO_PASS test sets needed to grade a candidate patch.
| Repository | Domain | Approximate share of Verified |
|---|---|---|
| Django | Web framework | ~45% |
| SymPy | Symbolic mathematics | ~15% |
| Sphinx | Documentation generator | ~10% |
| Matplotlib | Data visualization | ~8% |
| scikit-learn | Machine learning library | ~7% |
| Flask | Web microframework | ~5% |
| Requests | HTTP library | ~3% |
| Pytest | Testing framework | ~2% |
| Astropy | Astronomy tools | ~2% |
| Xarray | N-D labeled arrays | ~1% |
| Seaborn | Statistical visualization | ~1% |
| Pylint | Code analysis | ~1% |
Django dominates because it has both a long history of high-quality issue tracking and a large, well-tested codebase that supplies many candidate tasks. The five largest repositories account for roughly 85% of the dataset, which mirrors the distribution in the parent SWE-bench. The skew has been criticized for over-indexing the benchmark on Django-flavored web framework idioms, though most of the strongest agent submissions report consistent performance across the repository mix.[1][9]
Verified uses the same Docker harness as the broader SWE-bench family. Each task is associated with a specific environment image that fixes the Python version and the exact dependency versions present at the issue's commit. This containerization is what lets different labs compare scores: without pinned environments, NumPy or pytest version drift alone could swing scores by several percentage points.[2][11]
A task is marked resolved if and only if every FAIL_TO_PASS test passes after applying the agent's patch and every PASS_TO_PASS test continues to pass. A patch that fixes the bug but breaks an unrelated regression test counts as a failure. A patch that does not apply cleanly because of whitespace drift also counts as a failure. The strictness of the grading is part of why Verified produces interpretable signals: an agent that achieves 80% has met both the bug-fix and the no-regression bars on 400 separate tasks.[1][2][11]
The primary metric is % Resolved, the share of the 500 tasks for which the agent's patch passes the grading rule. Secondary metrics that became common in the late-2024 to early-2026 reporting cycle include:
From mid-2025 onward, several leaderboards (notably swebench.com, llm-stats.com, and Scale's labs.scale.com) added cost columns next to the headline accuracy figure, partly in response to criticism that high-cost reasoning agents were inflating reported scores without delivering proportional value.[4][5][12]
Verified scores climbed faster than almost any other coding benchmark. The progression below tracks the public state-of-the-art at major checkpoints from launch through early 2026.
| Date | Public best % Resolved | Model / agent | Organization | Notes |
|---|---|---|---|---|
| Aug 2024 | 33.2% | GPT-4o + Agentless | OpenAI | First published baseline at launch |
| Aug 2024 | ~30% | Devin v0 | Cognition Labs | Topped early Verified leaderboard from Cognition |
| Sep 2024 | 45.0% | Previous SOTA agent (composite scaffolds) | Various | Pre-Claude 3.5 Sonnet plateau |
| Oct 2024 | 49.0% | Claude 3.5 Sonnet (new) | Anthropic | First single-model crossing of 49%[13] |
| Dec 2024 | 53.0% | OpenAI o1 + scaffold | OpenAI | Reasoning models begin to contribute |
| Feb 2025 | 62.3% | Claude 3.7 Sonnet | Anthropic | Extended thinking helps coding |
| Feb 2025 | 70.3% | Claude 3.7 Sonnet (extended thinking + custom scaffold) | Anthropic | First reported 70%+ Verified score |
| May 2025 | 64.93% | Claude Sonnet 4 | Anthropic | New Sonnet baseline |
| Aug 2025 | 74.5% | Claude Opus 4.1 | Anthropic | Headline number for many product launches |
| Aug 2025 | ~75% | GPT-5 (Codex scaffold) | OpenAI | OpenAI's first GPT-5 reporting on Verified |
| Nov 2025 | 80.9% | Claude Opus 4.5 | Anthropic | First confirmed 80%+ score |
| Feb 2026 | 85.0% | GPT-5.3 Codex | OpenAI | OpenAI's last headline Verified release |
| Feb 2026 | 80.6% | Gemini 3.1 Pro | Google's tied best | |
| Feb 2026 | 80.2% | MiniMax M2.5 | MiniMax | Open-weight tied score |
| Apr 2026 | 87.6% | Claude Opus 4.7 (1M context) | Anthropic | Public SOTA at time of OpenAI deprecation |
| Apr 2026 | 93.9% | Claude Mythos Preview | Anthropic | Internal cybersecurity model; not generally available |
The trajectory looks like a textbook capability ramp. From August 2024 to April 2026, the public state-of-the-art rose from roughly 33% to 87.6%, an increase of more than 54 percentage points in twenty months. Three forces drove the gain in roughly equal proportion: stronger base models (Claude 3.5 to 4.7, GPT-4o to GPT-5.3), better reasoning (extended thinking, chain-of-thought, self-verification), and more capable agent scaffolds (Agentless, SWE-agent, OpenHands, Claude Code).
In the days immediately after the August 13, 2024 release, the publicly visible Verified leaderboard was topped by Cognition Labs' Devin agent. Cognition had spent the first half of 2024 building Devin around the original SWE-bench full set and was in a strong position to evaluate against the curated subset on day one. Devin's early Verified results were in the high 20s to low 30s, comparable to the GPT-4o + Agentless baseline OpenAI published in the same week. This brief period mattered for the benchmark's perception: it showed that the leaderboard would be contested by both research labs and commercial agent vendors, and it set the template for marketing-quality SWE-bench reporting that subsequent product launches followed.[14][15]
The table below is a snapshot of the public SWE-bench Verified leaderboard from April 2026, sourced from the public swebench.com leaderboard, llm-stats.com aggregation, and lab-reported scores. Internal-only models like Claude Mythos Preview are excluded.[4][5]
| Rank | Model | Organization | % Resolved | Notes |
|---|---|---|---|---|
| 1 | Claude Opus 4.7 | Anthropic | 87.6% | 1M context, released April 16, 2026 |
| 2 | GPT-5.3 Codex | OpenAI | 85.0% | OpenAI's final Verified release before deprecation |
| 3 | Claude Opus 4.5 | Anthropic | 80.9% | First confirmed public 80%+ result |
| 4 | Claude Opus 4.6 | Anthropic | 80.8% | Mid-Q1 2026 release |
| 5 | Gemini 3.1 Pro | 80.6% | Google DeepMind | |
| 6 | MiniMax M2.5 | MiniMax | 80.2% | Open-weight |
| 7 | GPT-5.2 | OpenAI | 80.0% | Pre-Codex variant |
| 8 | Claude Sonnet 4.6 | Anthropic | 79.6% | Smaller, cheaper Anthropic model |
| 9 | Qwen3.6 Plus | Alibaba | 78.8% | Released April 2026 |
| 10 | Gemini 3 Flash | 78.0% | Cheaper Google variant | |
| 11 | MiMo-V2-Pro | Xiaomi | 78.0% | Open-source 1T-param model |
| 12 | GLM-5 | Zhipu AI | 77.8% | Open-source, 744B parameters |
| 13 | Muse Spark | Meta | 77.4% | Meta Superintelligence Labs flagship |
| 14 | Claude Sonnet 4.5 | Anthropic | 77.2% | Late-2025 release |
| 15 | Kimi K2.5 | Moonshot AI | 76.8% | Open-source |
The agent scaffolds that dominated the Verified leaderboard tended to combine an interactive shell or editor with a search tool, a test runner, and some form of self-verification. Several reference designs are worth singling out.
Agentless, developed jointly by researchers at the University of Illinois Urbana-Champaign and Princeton, eschews an autonomous agent loop in favor of a fixed three-stage pipeline: localize the bug, repair it, and validate the fix. OpenAI used Agentless paired with GPT-4o for the original 33.2% Verified baseline because the simple pipeline made it easy to attribute scores to model capability rather than scaffold engineering. Agentless remained the most popular reference for cheap, reproducible Verified runs through 2025.[1][16]
SWE-agent, the official Princeton agent paper published at NeurIPS 2024, introduced the Agent-Computer Interface (ACI): a curated set of shell commands that make repository navigation and editing easier for language models. SWE-agent's scaffolds powered many of the open-source Verified entries, including the Live-SWE-agent variant that reached 79.2% with Claude Opus 4.5 in late 2025.[17]
OpenHands consolidated several research scaffolds into a community-driven open-source platform supporting browser, shell, and editor tools. It became the de facto open-source counterweight to closed agent products like Devin and was widely used in academic Verified submissions. The framework supports multiple LLM backends and ships with reusable plugins for retrieval, multi-attempt sampling, and verifier-based selection.[18]
Aider, Paul Gauthier's open-source command-line coding assistant, runs benchmarks on Verified to compare different models on whole-file edits. It is mostly used as an interactive tool, but its public benchmark page played an outsized role in popularizing cost-adjusted reporting and the practice of comparing the same scaffold across many models.[19]
Claude Code, Anthropic's terminal-based agent introduced in early 2025, became the reference scaffold for Anthropic's own Verified numbers. Its bash tool, file editor, and computer-use tool let the underlying Claude model do most of the heavy lifting with a relatively shallow surrounding harness. Claude Code's design influenced how other labs approached agent engineering, especially the move toward giving the model more direct shell access rather than wrapping it in heavy intermediation.[20]
A second tier of agents has held steady positions in the mid-leaderboard. RA-Aid is a research agent built on top of OpenHands with a focus on retrieval-augmented planning. Moatless Tools is a lightweight scaffold popular with budget-conscious researchers. AutoCodeRover, developed at the National University of Singapore, uses spectrum-based fault localization to narrow the search space before patch generation. Each contributed important ideas (decoupled localization, deterministic patches, coverage-guided edits) that later commercial agents absorbed.[21][22]
| Agent / framework | Developer | Approach |
|---|---|---|
| Amazon Q Developer Agent | Amazon | Enterprise agent with AWS tooling |
| Atlassian Rovo Dev | Atlassian | Agentic coding inside Jira and Bitbucket |
| Cursor Composer | Anysphere | IDE-based agent with human-in-the-loop edits |
| Codex | OpenAI | Cloud-based agent in sandboxed environments |
| Augment Code | Augment | Context-aware agent for large codebases |
| iSWE-Agent | IBM | Java-focused agent used on Multi-SWE-bench |
| CodeR | Independent | Multi-agent design with role specialization |
Independent of any specific framework, the highest-scoring Verified entries through 2025 and 2026 tended to share a few patterns. They run multiple candidate patches and use a verifier (often an LLM-as-judge with regression testing) to pick the best one. They allow generous tool use (file search, AST queries, language servers, test runners) instead of restricting the agent to a fixed action set. They feed the test output back into the model in a tight loop so the agent can self-correct. And they explicitly handle whitespace and formatting drift in patch generation, which avoided the silent grading failures that plagued earlier submissions.
On February 23, 2026, OpenAI's Mia Glaese and Olivia Watkins published "Why SWE-bench Verified no longer measures frontier coding capabilities." The post argued that the benchmark had reached the end of its useful life as a frontier evaluation and recommended SWE-bench Pro as the new community standard.[3]
OpenAI's case rested on three pillars:
OpenAI recommended migrating to SWE-bench Pro, which uses 1,865 longer tasks across more diverse public, held-out, and commercial codebases. Pro tasks are explicitly chosen to require 1 to 4 hours of human effort or more, in contrast with Verified's bias toward sub-hour fixes. Initial Pro scores from frontier models ran 20 to 40 percentage points below their Verified scores, which OpenAI argued was the kind of measurement headroom that frontier evaluation requires.[3][24]
The deprecation post was discussed extensively on the Latent Space podcast in the same week, with Glaese and Watkins arguing that the move was about measurement validity rather than a critique of Verified's design. Anthropic, Google DeepMind, and Meta did not immediately stop reporting Verified numbers, but most subsequent product launches paired the Verified score with a Pro score. Several outlets (CodeSOTA, blockchain.news, Latent Space, marc0.dev) wrote retrospectives within weeks framing the deprecation as a generational shift in how the field measures coding ability.[3][6][23][24][25]
OpenAI's deprecation accelerated work on a family of successor benchmarks that aim to restore measurement validity. The most influential are summarized below.
| Successor | Released by | Released | Size | Why it matters |
|---|---|---|---|---|
| SWE-bench Pro | Scale AI | 2025 | 1,865 long-horizon tasks across public, held-out, and commercial codebases | OpenAI's recommended successor; longer tasks, less contamination |
| SWE-bench Multimodal | Princeton SWE-bench team | October 2024 | 619 JavaScript tasks with embedded images | Tests visual grounding (UI screenshots, error images, diagrams) |
| SWE-bench Multilingual | Princeton SWE-bench team | 2025 | 300 tasks across 9 languages (C, C++, Go, Java, JS, TS, PHP, Ruby, Rust) | Breaks the Python monoculture |
| Multi-SWE-bench | ByteDance Seed | 2025 (NeurIPS 2025 D&B) | 1,632 instances across 7 languages | Independent multilingual effort |
| SWE-bench Live | Microsoft Research | May 2025 | 1,565+ instances; updated monthly | Anti-contamination via post-cutoff issues |
| SWE-Lancer | OpenAI | February 2025 | 1,400+ Upwork tasks ($1M payouts) | Dollars-earned metric; freelance simulation |
| SWE-rebench | Independent | 2025 | Continuously updated Python set | Decontamination focus |
| SWE-bench+ | OpenLM.ai | October 2024 | Filtered subset of original SWE-bench | Removed leaked instances |
OpenAI launched SWE-Lancer in February 2025 to evaluate models on real freelance software engineering tasks scraped from Upwork and verified by Expensify. The benchmark covers more than 1,400 tasks with payouts totaling roughly $1 million in real dollars, ranging from $50 bug fixes to $32,000 feature implementations. Tasks split between independent contributor (IC) work and managerial decisions over technical proposals. Initial reported scores were modest: Claude 3.5 Sonnet completed 26.2% of IC tasks and 44.9% of managerial tasks, while OpenAI's o1 reached 20.3% and 46.3% respectively. The dollars-earned metric resonated outside the academic community and made SWE-Lancer popular with industry analysts.[26]
Introduced in October 2024 (arXiv:2410.03859), SWE-bench Multimodal extends the harness to JavaScript repositories with image-bearing issues. The dataset contains 619 task instances drawn from 17 user-facing JavaScript repositories, with 862 images embedded across the problem statements. Image categories include code screenshots (194 instances), diagrams (107), error messages (54), digital art (38), maps (35), and data visualizations (28). Multimodal scores have lagged Verified scores by a wide margin because the task requires both visual grounding and code reasoning, and JavaScript test frameworks (Jest, Mocha, Playwright) make patch validation more involved than the Python set.[27]
OpenAI's deprecation audit confirmed earlier independent findings that Verified's test suites are not always trustworthy. An empirical study published as arXiv:2503.15223 reported that more than 15% of Verified instances have incomplete test patches that allow incorrect or partial solutions to pass. Specifically, 12.50% of passing patches were judged functionally or semantically incorrect, and 9.82% were incomplete. Frameworks like UTBoost and PatchDiff suggested leaderboard scores may be inflated by 6 to 7 percentage points due to test inadequacies. OpenAI's own audit reported even higher rates of broken specifications among the hardest unsolved tasks.[3][28]
Verified inherits the parent benchmark's contamination problem. More than 94% of SWE-bench issues were filed before the knowledge cutoff dates of major pre-trained language models. Subsequent audits, including the OpenLM.ai SWE-bench+ analysis and OpenAI's February 2026 study, demonstrated that frontier models could regenerate parts of the gold patch or problem statement when prompted with only a Task ID. Even SWE-bench Live's monthly refresh did not entirely solve the problem because tasks rotate into model training data on a similar timescale.[3][7]
The 12 source repositories are all open-source Python projects with strong test cultures. Many real-world codebases have sparse tests, proprietary dependencies, or architectural patterns not represented in this set. As a result, high Verified scores do not necessarily predict performance on arbitrary production codebases. The Django dominance in particular gives the dataset a web framework flavor that is over-represented compared with the broader software ecosystem.[9][10]
Epoch AI's analysis showed that the majority of Verified tasks are relatively simple. About 91% can be completed by a human in under one hour and 39.2% in under 15 minutes. The benchmark therefore primarily measures an agent's ability to fix straightforward bugs rather than tackle architectural changes or large feature implementations. This skew is the single biggest reason successor benchmarks like SWE-bench Pro and SWE-Lancer reset the difficulty floor.[9][10]
Verified is Python-only. Performance on Verified does not reliably generalize to JavaScript, Java, C++, Go, or Rust. SWE-bench Multilingual, Multi-SWE-bench, and SWE-bench Live partially address this gap, but Verified's outsized leaderboard role meant the field's headline metric was Python-bound through early 2026.[3][27]
A full Verified evaluation is resource-intensive: at minimum 120 GB of free disk space, 16 GB of RAM, and 8 CPU cores, with 32 GB recommended for parallel execution. Cloud evaluation through Modal removes the local hardware burden but introduces direct dollar cost. A reasoning-heavy agent run with Claude Opus 4.7 and 200K thinking tokens per task can spend $10 to $20 per attempt, which makes top-of-leaderboard reproductions expensive even for well-funded labs.[2][11]
The most consequential limitation is the one OpenAI cited in February 2026: with public scores in the high 80s and Anthropic's internal Claude Mythos Preview at 93.9%, the gap between frontier models and the score ceiling is now within harness noise. Verified can still discriminate between weaker models, but the lab competition that defined the late-2024 to late-2025 era has effectively ended.[3]