**
| SWE-bench | |
|---|---|
| Overview | |
| Full name | Software Engineering Benchmark |
| Abbreviation | SWE-bench |
| Description | A benchmark for evaluating large language models and AI agents on real-world software engineering tasks from GitHub |
| Release date | 2023-10-10 |
| Latest variant | SWE-bench Live (monthly refresh), SWE-bench Pro (Scale AI) |
| Authors | Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, Karthik Narasimhan |
| Organization | Princeton University, University of Chicago, Stanford University |
| Technical Details | |
| Type | Software Engineering, Code Generation, Bug Fixing |
| Modality | Text, Code |
| Task format | Issue resolution, Code editing |
| Number of tasks | 2,294 |
| Total examples | 2,294 (Full), 500 (Verified), 300 (Lite), 619 (Multimodal), 1,565 (Live), 1,865 (Pro) |
| Evaluation metric | % Resolved, Test Pass Rate |
| Domains | Software Engineering, Python Programming, Open Source Development |
| Languages | Python (Original/Verified/Lite), JavaScript (Multimodal), 9+ languages (Multilingual/Live/Pro) |
| Performance | |
| Baseline (Oct 2023) | 1.96% (Claude 2 with BM25) |
| SOTA (Verified) | 87.6% (Claude Opus 4.7) |
| SOTA date | 2026-04 |
| Saturated | Yes (per OpenAI February 2026) |
| Resources | |
| Website | Official website |
| Paper | Paper |
| GitHub | Repository |
| Dataset | Download |
| License | MIT License |
SWE-bench** (Software Engineering Benchmark) is a comprehensive benchmark designed to evaluate large language models and AI agents on their ability to solve real-world software engineering tasks. Released on October 10, 2023, by researchers at Princeton University, SWE-bench tests whether AI systems can autonomously resolve genuine GitHub issues from popular open-source Python repositories. The benchmark was published as a conference paper at the International Conference on Learning Representations (ICLR) 2024, where it received an oral presentation, one of the highest distinctions at that venue.[1][2]
SWE-bench has become the de facto standard for evaluating AI-powered software engineering capabilities. With over 2 million downloads from Hugging Face and adoption by leading AI research organizations worldwide, it has shaped how the industry measures progress in autonomous coding. The benchmark's focus on real bug reports and feature requests from production codebases sets it apart from earlier coding benchmarks like HumanEval, which test isolated function-level problems.[1][3]
Progress on SWE-bench has been extraordinary. The original 2023 paper reported a peak resolution rate of 1.96% for Claude 2 with BM25 retrieval, leading the authors to conclude that frontier models could only solve the simplest issues. By April 2026, Anthropic's Claude Opus 4.7 reached 87.6% on SWE-bench Verified, a roughly 45-fold improvement in two and a half years.[18][19] The pace of gains, combined with later evidence of leakage and contamination, prompted OpenAI to formally deprecate SWE-bench Verified for frontier evaluations in February 2026 and recommend SWE-bench Pro instead.[20][21]
Before SWE-bench, AI code generation benchmarks largely consisted of self-contained programming challenges. HumanEval, introduced by OpenAI in 2021, asked models to complete standalone Python functions. MBPP (Mostly Basic Python Programming) followed a similar pattern. While useful for measuring raw code synthesis ability, these benchmarks did not capture the complexity of professional software engineering, where developers must navigate large codebases, understand interconnected modules, read issue trackers, and ensure that changes do not break existing functionality.
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan at Princeton University recognized this gap. Their core insight was that open-source repositories on GitHub contain a natural record of software engineering tasks in the form of issues and pull requests. Each merged pull request that resolves an issue represents a verified solution to a real problem, complete with test cases that validate correctness. By collecting these issue-PR pairs and replaying them in controlled environments, the team constructed a benchmark that mirrors actual developer workflows.[1]
The paper posed a direct question in its title: "Can Language Models Resolve Real-World GitHub Issues?" At the time of release, the answer was sobering. The best model, Claude 2 by Anthropic, resolved only 1.96% of tasks when given files retrieved through BM25, a standard information retrieval algorithm. Even with oracle retrieval, where the model received the exact files that needed editing, performance topped out at just 4.80%.[1] Fine-tuned versions of CodeLlama-7B and CodeLlama-13B, branded SWE-Llama, performed comparably or slightly worse, despite being trained on the SWE-bench-train companion set. The picture in late 2023 was clear: frontier LLMs understood Python syntax but struggled to operate inside production repositories.
The SWE-bench dataset draws from 12 popular open-source Python repositories chosen for their maturity, active maintenance, and comprehensive test suites. The repositories span diverse software domains including web frameworks, scientific computing, data visualization, testing tools, and utility libraries.[1]
| Repository | Domain | Task Instances | Percentage |
|---|---|---|---|
| django/django | Web framework | 850 | 37.1% |
| sympy/sympy | Symbolic mathematics | 386 | 16.8% |
| scikit-learn/scikit-learn | Machine learning | 229 | 10.0% |
| sphinx-doc/sphinx | Documentation generator | 187 | 8.2% |
| matplotlib/matplotlib | Data visualization | 184 | 8.0% |
| pytest-dev/pytest | Testing framework | 119 | 5.2% |
| pydata/xarray | Labeled array data | 110 | 4.8% |
| astropy/astropy | Astronomy library | 95 | 4.1% |
| pylint-dev/pylint | Code linter | 57 | 2.5% |
| psf/requests | HTTP library | 44 | 1.9% |
| mwaskom/seaborn | Statistical visualization | 22 | 1.0% |
| pallets/flask | Web microframework | 11 | 0.5% |
| Total | 2,294 | 100% |
Django accounts for the largest share (37.1%) because it has an extensive issue tracker and long development history. Flask contributes the fewest instances (11) due to its smaller codebase and less frequent issue activity. The 12 repositories together contain hundreds of thousands of files; even after retrieval, an agent typically operates inside a working tree of 5,000 to 50,000 source files.[1]
Each task instance in SWE-bench is derived from a pull request that resolves one or more GitHub issues. The construction pipeline follows these steps:[1]
Pull request collection: The team scraped merged pull requests from each repository that explicitly referenced resolved issues.
Test identification: For each PR, the pipeline identifies "FAIL_TO_PASS" tests, which are tests that fail on the pre-fix codebase but pass after the fix is applied. These tests validate that the PR actually addresses the described issue.
Regression test extraction: The pipeline also identifies "PASS_TO_PASS" tests, which are tests that pass both before and after the fix. These ensure that a proposed solution does not introduce regressions.
Environment snapshotting: Each task records the exact repository commit state before the fix, along with the dependency versions and Python version needed to reproduce the environment.
Validation: Every task instance is validated by running both FAIL_TO_PASS and PASS_TO_PASS tests against the gold patch (the actual PR diff) to confirm that the tests behave as expected.
The team also released SWE-bench-train, a companion training set comprising approximately 19,000 non-testing task instances drawn from 37 repositories, giving researchers a larger pool for fine-tuning experiments without contaminating the evaluation set.[1]
A notable property of the construction is that every issue is paired with a real human solution. This gold patch defines what "correct" means at the file level (which files were changed, how many lines were added or removed) and at the behavioral level (which tests now pass). Because the human solution exists, SWE-bench can score agents end-to-end without relying on stylistic similarity metrics like BLEU; the unit tests are the judge.
In the original evaluation setup, the benchmark uses two approaches for providing code context to the model:[1]
BM25 retrieval: The Pyserini BM25 retriever selects relevant files from the repository based on the issue text. A context budget of 27,000 tokens (measured with OpenAI's cl100k_base tokenizer) is allocated. In roughly 40% of instances, BM25 retrieves a superset of the files that actually need editing. However, in nearly half of cases, it fails to retrieve any of the needed files.
Oracle retrieval: The model receives the exact files modified in the gold patch. This provides an upper bound on performance when file localization is perfect.
Modern agent-based approaches have largely moved beyond static retrieval, instead allowing the AI to interactively browse the repository, search for definitions, and navigate the codebase. Tools like ripgrep, language-server queries, and AST-aware code search have become standard components of agent toolkits, and modern context windows of 200K to 1M tokens have made it practical to load substantial slices of a repository at once.
SWE-bench uses Docker containers to ensure reproducible and isolated evaluation. The evaluation harness builds images in three layers:[3][10]
This layered approach minimizes build time while ensuring that each task runs in an environment identical to the one where the original bug was observed and fixed. Without containerization, comparing scores across labs would be nearly impossible because Python dependency resolution is famously fragile, and a single difference in NumPy or pandas versions can change which tests pass.
The standard evaluation flow consists of these steps:[3]
git apply or equivalent tooling.A patch that addresses the issue but breaks an unrelated test counts as a failure, which discourages over-aggressive refactors. A patch that does not apply cleanly (because of formatting drift or whitespace) also counts as a failure; submissions are expected to produce patches that are syntactically valid against the exact source state recorded in the task.
The primary metric is % Resolved, the percentage of task instances where the agent's patch causes all FAIL_TO_PASS tests to pass without breaking any PASS_TO_PASS tests. Additional metrics tracked by the community include:[3]
Cost reporting has become an increasingly important secondary metric. Two agents at 70% resolution rate are not equivalent if one consumes $0.50 per task and the other $15. Recent agent papers commonly publish per-instance dollar cost alongside the headline accuracy figure.
As of January 2025, the SWE-bench team introduced cloud-based evaluation through integration with Modal, removing the need for local Docker infrastructure. Researchers can run evaluations entirely in the cloud by installing the swebench[modal] package and setting the --modal true flag.[10]
For leaderboard submissions, the team released sb-cli, a command-line tool that standardizes the submission process. After authenticating with sb login, researchers submit predictions using sb submit --predictions <path>, and the evaluation runs on centralized infrastructure to ensure consistent and reproducible results.[10]
The SWE-bench team and outside contributors have produced a family of related datasets that share the same evaluation harness but differ in size, language coverage, contamination resistance, and difficulty. The table below summarizes the major variants.
| Variant | Released | Size | Languages | Notes |
|---|---|---|---|---|
| SWE-bench (Full) | Oct 2023 | 2,294 | Python | Original benchmark; 12 repositories |
| SWE-bench Lite | Mar 2024 | 300 | Python | Cheaper subset; quick iteration |
| SWE-bench Verified | Aug 2024 | 500 | Python | Human-curated by 93 reviewers; OpenAI collaboration |
| SWE-bench Multimodal | Oct 2024 | 619 | JavaScript | Issues with images; UI bugs |
| SWE-bench Multilingual | 2025 | 300 | 9 languages | C, C++, Go, Java, JS, TS, PHP, Ruby, Rust |
| SWE-bench Live | May 2025 | 1,565 (and growing) | 8+ languages | Monthly refresh; anti-contamination |
| Multi-SWE-bench | 2025 | 1,632 | 7 languages | ByteDance Seed; NeurIPS 2025 D&B |
| SWE-bench Pro | 2025 | 1,865 | Multiple | Scale AI; commercial codebases |
| SWE-bench+ | Oct 2024 | Filtered subset | Python | OpenLM.ai; leakage removed |
| SWE-rebench | 2025 | Continuously updated | Python | Decontamination focus |
SWE-bench Lite is a curated subset of 300 instances selected for more efficient evaluation. The instances focus on self-contained, functional bug fixes that can be resolved with targeted code changes, making the subset well-suited for rapid prototyping and iterative development of new agent architectures. SWE-bench Lite has become popular in the research community because a full evaluation run can be completed in a fraction of the time required for the full benchmark.[5]
Despite its name, Lite is not trivial. As of April 2026, the leading score on SWE-bench Lite was 62.7% by Claude Opus 4.6, with MiniMax M2.5 in second at 56.3%, well short of the 80%+ scores recorded on the larger Verified set. The gap is partly because Lite was constructed from the original 2,294-instance pool and includes some of the same noisy task descriptions that the Verified curation effort later filtered out.[22]
Released on August 13, 2024, in collaboration with OpenAI Preparedness, SWE-bench Verified contains 500 instances that were individually reviewed by 93 experienced software developers. Each task was checked to confirm that the problem description was clear, the solution was unambiguous, the test coverage was adequate, and the difficulty was reasonable. By filtering out noisy or poorly specified tasks, Verified provides a more reliable signal of agent capability and has become the most widely cited variant on leaderboards.[4]
The curation effort screened 1,699 candidate problems, with each problem reviewed by three experts independently. About 38.3% of samples were flagged for underspecified problem statements, and 61.1% were flagged for unit tests that may unfairly mark valid solutions as incorrect. In total, roughly 68.3% of original SWE-bench samples were removed because of underspecification, unfair tests, or other issues, leaving the curated 500. The first official baseline reported by OpenAI on Verified was 33.2% for GPT-4o paired with the Agentless scaffold, a substantial jump compared with the few-percent scores typical on the full benchmark.[4][23]
According to an analysis by Epoch AI, 39% of SWE-bench Verified tasks are "trivial changes" requiring fewer than 15 minutes of human effort, while 52% are "small changes" estimated at 15 minutes to one hour. Only about 8% fall into the "1 to 4 hour" range, and just three instances were estimated to require more than four hours. Quick fixes average around 5 changed lines of code, while the longer tasks average roughly 50 changed lines.[11]
Verified became the default reporting standard for nearly every frontier coding model release between late 2024 and early 2026. Anthropic's Claude 3.5 Sonnet, Claude 3.7 Sonnet, Claude Sonnet 4.5, Claude Opus 4.5, Claude Opus 4.6, and Claude Opus 4.7 all reported headline numbers on Verified, as did OpenAI's GPT-4o, o1, o3, GPT-5, and GPT-5.3 Codex; Google's Gemini 2.5 Pro and Gemini 3 family; DeepSeek-V4; MiniMax M2 and M2.5; Moonshot's Kimi K2; Alibaba's Qwen series; and Meta's Muse Spark.[18]
Introduced in October 2024 (arXiv:2410.03859), SWE-bench Multimodal extends the benchmark to tasks where the issue report contains visual elements. The dataset comprises 619 task instances drawn from 17 user-facing JavaScript repositories, covering domains such as web UI design, data visualization, digital art, and mapping.[6]
Across all task instances, there are 862 images embedded in problem statements. These include code screenshots (194 instances), diagrams (107), error messages (54), digital art (38), maps (35), and data visualizations (28). The variant tests whether AI systems can ground visual cues to specific codebase entities, for example recognizing a rendering bug from a screenshot and tracing it to the responsible CSS or JavaScript code.[6]
Multimodal scores have lagged Verified by a wide margin because the task requires both vision and code reasoning, plus the JavaScript repositories use different testing frameworks (Jest, Mocha, Playwright) than the Python set, making patch validation harder.
SWE-bench Multilingual extends the evaluation paradigm beyond Python to 9 programming languages: C, C++, Go, Java, JavaScript, TypeScript, PHP, Ruby, and Rust. It contains 300 tasks across 42 repositories and follows the same collection strategy, dataset format, and evaluation protocol as the original benchmark. The deliberately smaller size ensures that evaluations remain quick to run.[12]
Developed by ByteDance Seed and accepted to the NeurIPS 2025 Datasets and Benchmarks track, Multi-SWE-bench is a separate multilingual effort containing 1,632 high-quality instances across Java, TypeScript, JavaScript, Go, Rust, C, and C++. The instances were carefully annotated from 2,456 candidates by 68 expert annotators, giving it broader coverage than SWE-bench Multilingual while maintaining high quality standards.[13]
SWE-bench Live, developed by Microsoft Research, addresses data contamination concerns by restricting the dataset to issues created after January 2024. Because these issues postdate the training cutoffs of most models in circulation when the benchmark launched, they provide a contamination-free evaluation signal. The platform now contains 1,565 task instances spanning 164 repositories across Python, C, C++, C#, Java, Go, JavaScript and TypeScript, and Rust, with both Linux and Windows runners. The dataset is updated monthly, adding 50 newly verified, high-quality issues each cycle.[7][24]
A lite subset of SWE-bench Live samples 50 instances per month from October 2024 to March 2025, yielding a compact set of 300 instances that balances recency, diversity, and evaluation efficiency. To guard against test flakiness, the validation process is repeated multiple times and only instances with consistent results across all runs are retained.[7]
Introduced by Scale AI in 2025, SWE-bench Pro is designed to be a more rigorous benchmark that better reflects real-world software engineering difficulty. It expands to 1,865 long-horizon tasks across public, held-out, and commercial codebases. Tasks are explicitly chosen to require 1 to 4 hours of human effort or more, in contrast with Verified's bias toward sub-hour fixes. Top models score around 23% on Pro's public set in early reports and roughly 56% to 64% by April 2026, well below their Verified scores.[14][20]
The gap between Verified and Pro is widely cited as evidence that Verified scores are no longer informative for frontier evaluation. Models that score around 80% on Verified routinely drop to the 23%-to-46% range on Pro, with even the strongest April 2026 systems landing in the mid-60s.[20][25]
SWE-bench+ is a third-party effort by OpenLM.ai that filters the original benchmark to remove instances showing signs of solution leakage. Its analyses, discussed in the criticism section below, helped seed the broader contamination conversation. SWE-rebench is a separate contamination-resistant evaluation platform that publishes its own continuously updated leaderboard and uses different sampling strategies than SWE-bench Live.[16][26]
Performance on SWE-bench has improved dramatically since the benchmark's release, reflecting rapid advances in AI coding capabilities. The headline figures for the Verified subset (where consistent comparison is possible from August 2024 onward) are summarized below.
| Date | Best % Resolved | Leading model / agent | Key milestone |
|---|---|---|---|
| October 2023 | 1.96% (Full) | Claude 2 (BM25 retrieval) | Benchmark release |
| March 2024 | 13.86% (Full) | Devin (Cognition Labs) | First commercial agent above 10% |
| April 2024 | 12.47% (Full) | SWE-agent + GPT-4 | Open-source agent baseline |
| August 2024 | 33.20% (Verified) | GPT-4o + Agentless | Verified launches |
| October 2024 | 49.0% (Verified) | Claude 3.5 Sonnet (new) | Anthropic crosses 49% |
| December 2024 | 53.0% (Verified) | OpenAI o1 + scaffold | Reasoning models gain ground |
| February 2025 | 62.3% (Verified) | Claude 3.7 Sonnet | Extended thinking helps |
| May 2025 | 64.93% (Verified) | Claude Sonnet 4 | 60% barrier broken cleanly |
| August 2025 | 74.5% (Verified) | Claude Opus 4.1 | New SOTA |
| November 2025 | 80.9% (Verified) | Claude Opus 4.5 | First confirmed 80%+ score |
| February 2026 | 85.0% (Verified) | GPT-5.3 Codex | OpenAI peaks before deprecation |
| April 2026 | 87.6% (Verified) | Claude Opus 4.7 | Current public SOTA |
The trajectory is striking. In just over thirty months, the best score rose from 1.96% to 87.6%, a roughly 45-fold improvement. The pace was driven by three concurrent advances: stronger base models, better agent scaffolds, and richer tool use. Each leap on the table above can usually be attributed to one of those three.
The public SWE-bench Verified leaderboard remains in flux because many labs have stopped submitting after OpenAI's February 2026 deprecation announcement. The April 2026 snapshot below combines submissions to swebench.com, llm-stats.com, and lab-reported scores. Anthropic's Claude Mythos Preview, an internal cybersecurity-focused model that the company has stated will not be made generally available, posted 93.9% but is excluded from the table because it is not a public release.[18][19][27]
| Rank | Model / agent | Organization | % Resolved | Date |
|---|---|---|---|---|
| 1 | Claude Opus 4.7 | Anthropic | 87.6% | 2026-04 |
| 2 | GPT-5.3 Codex | OpenAI | 85.0% | 2026-03 |
| 3 | Claude Opus 4.5 | Anthropic | 80.9% | 2026-03 |
| 4 | Claude Opus 4.6 | Anthropic | 80.8% | 2026-03 |
| 5 | Gemini 3.1 Pro | 80.6% | 2026-02 | |
| 5 | DeepSeek-V4-Pro-Max | DeepSeek | 80.6% | 2026-02 |
| 7 | MiniMax M2.5 | MiniMax | 80.2% | 2026-02 |
| 7 | Kimi K2.6 | Moonshot AI | 80.2% | 2026-03 |
| 9 | GPT-5.2 | OpenAI | 80.0% | 2026-02 |
| 10 | Claude Sonnet 4.6 | Anthropic | 79.6% | 2026-03 |
| 11 | DeepSeek-V4-Flash-Max | DeepSeek | 79.0% | 2026-02 |
| 12 | Qwen3.6 Plus | Alibaba | 78.8% | 2026-04 |
| 13 | MiMo-V2-Pro | Xiaomi | 78.0% | 2026-03 |
| 13 | Gemini 3 Flash | 78.0% | 2026-02 | |
| 15 | GLM-5 | Zhipu AI | 77.8% | 2026-04 |
| Rank | Model / agent | Organization | % Resolved |
|---|---|---|---|
| 1 | Claude Opus 4.7 | Anthropic | 64.3% |
| 2 | GPT-5.4 (xHigh) | OpenAI | 59.1% |
| 3 | GPT-5.3 Codex | OpenAI | 56.8% |
| 4 | GPT-5.2 Codex | OpenAI | 56.4% |
| 5 | GPT-5.2 | OpenAI | 55.6% |
| Rank | Model / agent | Organization | % Resolved |
|---|---|---|---|
| 1 | Claude Opus 4.6 | Anthropic | 62.7% |
| 2 | MiniMax M2.5 | MiniMax | 56.3% |
| 3 | Claude Sonnet 4.6 | Anthropic | 55.0% |
| 4 | GPT-5.3 Codex | OpenAI | 53.7% |
| 5 | Gemini 3.1 Pro | 52.0% |
The gap between Verified scores (around 80% to 88%) and Pro scores (around 56% to 64%) highlights that harder, less-contaminated benchmarks still present significant challenges. Lite numbers run lower than Verified because the Lite subset retains some of the noisy task descriptions that the Verified curation effort filtered out, so even an oracle-level agent cannot push much higher than the mid-60s without leniency from the harness.
Publishing top-of-leaderboard accuracy without cost has become controversial. A reasoning-heavy agent that runs Claude Opus 4.7 with 200K thinking tokens per task can spend $10 to $20 per attempt; a faster Sonnet-class scaffold may achieve 65% to 70% at well under $1 per task. Several papers in 2025 and 2026 have proposed Pareto front reporting, plotting accuracy against dollar cost rather than ranking purely by accuracy, and Scale AI's leaderboard added cost columns in early 2026 to encourage more honest comparisons.[20][25]
SWE-bench has spurred the development of diverse agent architectures for autonomous software engineering. Most modern submissions follow the same broad pattern of an LLM driving a shell or editor, but the harness around it varies widely.
SWE-agent, developed by the same Princeton NLP group behind SWE-bench, is the official open-source baseline agent. Published at NeurIPS 2024, it introduced the concept of an Agent-Computer Interface (ACI): a set of custom shell commands designed to make repository navigation, code viewing, and editing easier for language models.[8]
The architecture works as follows:
SWE-agent supports multiple LLM backends including GPT-4, Claude, and open-source models. When paired with Claude Opus 4.5, the Live-SWE-agent scaffold achieves 79.2% on SWE-bench Verified.[8] The paper's enduring contribution is the argument that agent performance depends as much on the interface design (which commands, what feedback, how truncation works) as on the underlying model.
Devin, announced by Cognition Labs in March 2024, was one of the first commercial AI software engineering agents to gain widespread attention. On its initial SWE-bench evaluation, Devin resolved 13.86% of tasks unassisted (79 out of 570 tested), far exceeding the previous best of 1.96% (unassisted) and 4.80% (assisted with oracle retrieval). Notably, 72% of Devin's successful resolutions took over 10 minutes, indicating that its ability to iterate, run tests, and refine solutions contributed to its performance.[15]
Devin's launch demo, which showed the agent shipping a Bun benchmark, debugging a YOLO model, and posting on Upwork, generated extraordinary press coverage and helped Cognition raise more than $175 million. Subsequent independent evaluations were more skeptical: an October 2024 review by AI explainability lab Answer.AI found that Devin completed three out of twenty real-world tasks, with several runs ending in malformed PRs. Even so, the moment marked the start of public competition over SWE-bench scores as a marketing channel.
Aider is a popular open-source command-line coding assistant by Paul Gauthier that pairs a chat interface with whole-file edits. It uses tree-sitter to map repository structure and offers a benchmark mode that runs SWE-bench instances. While Aider is designed primarily as an interactive tool rather than a fully autonomous agent, its leaderboard publication helped popularize cost-adjusted reporting and the practice of comparing the same scaffold across many models.
OpenHands, formerly OpenDevin, is a community-driven open-source agent framework that consolidated several research scaffolds into a shared platform. By integrating browser, shell, and editor tools and supporting multiple LLM backends, OpenHands has been used to reproduce and extend submissions from Anthropic, Mistral, and academic labs. It powers a number of mid-tier entries on the public leaderboard.
Anthropic has reported the strongest sustained results on SWE-bench Verified across the Claude 3.5, 4, 4.5, 4.6, and 4.7 generations. The lab's Claude Code terminal agent, launched in early 2025, was tuned partly with SWE-bench-style harnesses and is the reference scaffold for Anthropic's reported numbers. By late 2025, Claude Code's bash tool, file editor, and computer use capabilities allowed it to resolve a wide range of issues with relatively shallow scaffolding, with much of the heavy lifting performed by the model itself.
| Agent | Developer | Approach |
|---|---|---|
| Amazon Q Developer Agent | Amazon | Enterprise-integrated agent with AWS tooling |
| Atlassian Rovo Dev | Atlassian | Agentic coding within Jira/Bitbucket ecosystem |
| Cursor AI | Anysphere | IDE-based agent with human-in-the-loop editing |
| Codex | OpenAI | Cloud-based agent running in sandboxed environments |
| Augment Code | Augment | Context-aware agent for large enterprise codebases |
| Moatless Tools | Independent | Lightweight scaffold popular for low-cost runs |
| Agentless | Princeton/UIUC | Pipeline-based; no agent loop, used for early Verified baselines |
| AutoCodeRover | NUS/Stanford | Spectrum-based fault localization plus targeted edits |
| CodeR | Independent | Multi-agent design with role specialization |
| iSWE-Agent | IBM | Specialized agent for Java issue resolution on Multi-SWE-bench |
Beyond commercial agents, academic research has explored several innovative directions:
SWE-bench occupies a particular slice of the coding evaluation landscape. The table below contrasts it with other widely cited benchmarks.
| Benchmark | Granularity | Repo context | Agent loop | Languages | Typical task length |
|---|---|---|---|---|---|
| HumanEval | Function | None | No | Python | < 1 minute |
| MBPP | Function | None | No | Python | < 1 minute |
| CodeContests | Algorithm | Problem only | No | Multiple | Minutes |
| LiveCodeBench | Algorithm (timed) | Problem only | No | Python | Minutes |
| BigCodeBench | Function with libraries | Single file | No | Python | Minutes |
| RepoBench | Line completion | Repository | No | Python, Java | Seconds |
| CrossCodeEval | Cross-file completion | Repository | No | Python, Java, C#, TS | Seconds |
| SWE-bench | Issue resolution | Full repository | Yes | Python | 5 to 60 minutes |
| SWE-bench Pro | Issue resolution | Full repository | Yes | Multiple | 1 to 4+ hours |
| SWE-Lancer | Freelance project | Full repository, real money | Yes | Multiple | Minutes to days |
The contrast with HumanEval is the cleanest illustration of how the field has matured. HumanEval asks: "Given this docstring, write the function body." SWE-bench asks: "Given this issue, find the relevant code in a 50,000-file repository, understand why the test is failing, write a multi-file patch, and verify it passes the test suite without breaking anything else." The first is solved by autocomplete-quality code generation. The second requires planning, navigation, and self-correction, which is why it became the canonical agent benchmark.
The original 2,294-instance set had two well-known problems. First, the test suites for some issues did not actually verify the bug being reported, so a model could pass without truly fixing the issue. Second, some issue descriptions were so vague that even the original human author had needed back-and-forth in the comments to clarify the requirement. These problems combined to inject noise into model rankings and made small score differences hard to interpret.
The Verified curation, undertaken in collaboration with OpenAI Preparedness, addressed both. Ninety-three software developers reviewed each candidate task and labeled it on four axes: problem clarity, solution unambiguity, test coverage, and difficulty plausibility. Tasks failing any axis were removed. The result was a 500-instance set where a passing patch could be confidently interpreted as a real fix, not a lucky overfit to thin tests.
This curation is what made Verified the standard reporting target. From late 2024 through 2025, every frontier model release at Anthropic, OpenAI, Google DeepMind, DeepSeek, Moonshot, Alibaba, Meta, and Microsoft Research reported a Verified number alongside its other coding benchmarks, and product launches for Cursor, Claude Code, Cognition Devin, Amazon Q, and others highlighted Verified scores in their first-day marketing.
On February 23, 2026, OpenAI's Mia Glaese and Olivia Watkins published a blog post announcing that the lab would no longer report SWE-bench Verified scores for frontier model releases and would instead recommend SWE-bench Pro as the new community standard.[20][21] The decision was motivated by three findings.
First, the unsolved tasks contained a high proportion of broken specifications. OpenAI conducted a careful audit of 138 problems that were repeatedly missed by frontier models and found that more than 60% were unsolvable as stated. Forty-nine tests were too narrow, rejecting functionally correct submissions, while twenty-six tests were too wide, requiring features that were never mentioned in the issue. Roughly 59.4% of the hardest unsolved problems had flawed test cases, meaning further accuracy gains were no longer reliably measuring model improvements.[20][21][28]
Second, the lab discovered that every frontier model it tested could reproduce verbatim portions of the gold patch or problem statement when prompted with only a Verified Task ID. Models implicated in the audit included GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash, all of which appeared to have memorized parts of the dataset. In one striking example, GPT-5.2's chain-of-thought traces revealed knowledge of unspecified test requirements, suggesting the test patches had appeared somewhere in training data.[20][21]
Third, the benchmark was simply saturating. With public scores in the high 80s and internal scores in the 90s, the headroom remaining on Verified did not justify continued reporting, especially when the upper-bound score appeared to be limited by harness errors rather than capability gaps.
OpenAI's recommendation was to migrate to SWE-bench Pro, which uses 1,865 longer tasks across more diverse codebases and showed substantially less contamination evidence in their tests. Several other labs (Google DeepMind, Anthropic, Meta) continued to publish Verified scores after the OpenAI announcement but began including Pro numbers as well, and the joint reporting pattern is likely to persist through 2026 as the community converges on a successor.
Over 94% of SWE-bench issues were filed before the knowledge cutoff dates of major pre-trained language models. This raises the risk that models may have encountered the issues, discussions, or even the solution code during pre-training. While SWE-bench Live and SWE-bench Pro attempt to address this by using newer issues, the original benchmark and its Verified subset remain potentially contaminated, as confirmed by OpenAI's February 2026 audit.[7][16][20]
Research by the SWE-bench+ team (OpenLM.ai) found that approximately 32.67% of successful patches on the original benchmark involved solution leakage, where the issue report or its comments contained the solution code or strong hints pointing toward it. In a broader analysis, roughly 60% of resolved instances showed some form of direct or indirect solution leakage. When these problematic instances were filtered out, SWE-agent + GPT-4's resolution rate dropped from 12.47% to 3.97%, a substantial reduction.[16]
An empirical study found that over 15% of SWE-bench Verified instances have incomplete test patches that allow incorrect or partial solutions to pass the evaluation harness. Specifically, 12.50% of passing patches were found to be functionally or semantically incorrect, and 9.82% were incomplete, addressing only part of the issue or lacking necessary error handling. Advanced analysis frameworks like UTBoost and PatchDiff revealed that leaderboard scores may be inflated by 6 to 7 percentage points due to these test inadequacies.[17] OpenAI's later audit reported even higher rates of broken specifications among the hardest unsolved tasks.[20]
The original SWE-bench and its Verified and Lite subsets are limited to Python repositories. This means that performance on SWE-bench does not necessarily generalize to other languages commonly used in industry, such as Java, C++, TypeScript, or Go. SWE-bench Multilingual, Multi-SWE-bench, and SWE-bench Live partially address this gap but remain less widely adopted than the Python-focused variants.[12][13][24]
The 12 repositories in SWE-bench, while popular and well-maintained, represent a narrow slice of the software ecosystem. They are all open-source Python projects with strong test cultures. Many real-world codebases have sparse test coverage, proprietary dependencies, or architectural patterns not represented in these repositories. As a result, high SWE-bench scores may not predict performance on arbitrary production codebases.[11]
Epoch AI's analysis showed that the majority of SWE-bench Verified tasks are relatively simple: 91% can be completed by a human in under one hour, and the median gold patch changes only a handful of lines of code. This means that the benchmark primarily measures an agent's ability to fix straightforward bugs rather than tackle complex architectural challenges or large feature implementations.[11]
Running a full SWE-bench evaluation is resource-intensive, requiring at minimum 120 GB of free disk space, 16 GB of RAM, and 8 CPU cores (with 32 GB RAM recommended for parallel execution). The Docker image build process and per-instance container setup add significant overhead. Cloud evaluation through Modal reduces the hardware burden but introduces monetary costs.[3][10]
With multiple labs reporting Verified scores in the 80s and Anthropic posting an internal 93.9% with Claude Mythos Preview, the benchmark has clearly entered the saturation regime. Saturation does not mean coding is solved; it means the gap between the best models and the score ceiling has shrunk to within the noise of the harness. OpenAI's deprecation post acknowledged this directly, framing the move to Pro as a response to saturation rather than a critique of the benchmark's design.[20][27]
SWE-bench has had a substantial influence on AI and software engineering research:
SWE-bench has become a standard evaluation metric in the AI industry:
The benchmark directly influenced the development of several high-profile AI coding products. Devin's launch in March 2024 explicitly highlighted its SWE-bench performance, helping attract $175 million in funding for Cognition Labs. Amazon, Atlassian, GitHub, and other companies have built agent products with SWE-bench as a primary evaluation target. The competitive pressure created by the public leaderboard accelerated progress, with the top score rising from under 2% to over 87% in roughly two and a half years, and prompted Anthropic to design Claude Code with SWE-bench-style harnesses in mind.[15]
The success and eventual deprecation of SWE-bench Verified prompted a wave of successor benchmarks aimed at restoring measurement validity for frontier coding evaluation.
Launched by OpenAI in February 2025, SWE-Lancer evaluates models on more than 1,400 real freelance software engineering tasks scraped from Upwork and verified by Expensify, with payouts totaling roughly $1 million in real dollars. Tasks range from $50 bug fixes to $32,000 feature implementations and split between independent contributor (IC) work and managerial decisions over technical proposals. Initial reported scores were modest: Claude 3.5 Sonnet completed 26.2% of IC tasks and 44.9% of managerial tasks, while OpenAI's o1 reached 20.3% and 46.3% respectively. The benchmark introduced the unusual practice of mapping AI capability to dollars earned, a metric that resonates outside the academic community.[29]
Subsequent variants focus on harder Upwork tasks that took the original human freelancers more than ten hours, with stricter test harnesses and human review of edge cases. By April 2026, top frontier scores on the Diamond split had climbed into the 30s but still trailed the headline numbers reported on Verified.
LiveSWEBench and SWE-rebench follow the same general philosophy as SWE-bench Live: refresh the dataset with newer issues to outpace training cutoffs. They differ in repository selection, harness design, and update cadence. Both have published their own leaderboards through 2025 and 2026.
Several companies have built proprietary internal versions of SWE-bench using their own codebases. Anthropic, Google, and Meta have all alluded to such sets in technical reports. Scale AI's SWE-bench Pro includes a private split that customers can run against frontier models to generate scores that are not subject to public leakage. The pattern points toward a future where public benchmarks function as broad capability checkpoints while private, contamination-resistant evaluations drive enterprise procurement.
The SWE-bench team and independent researchers continue to extend the benchmark to additional programming languages. SWE-bench Multilingual already covers 9 languages, and Multi-SWE-bench adds more annotated instances for Java, TypeScript, and other ecosystems. Future work may include languages like Python's ML stack (C/CUDA extensions), mobile development languages (Swift, Kotlin), and systems programming languages.[12][13]
SWE-bench Pro and similar efforts aim to move beyond simple bug fixes toward tasks that require deeper reasoning: large refactoring operations, security vulnerability remediation, performance optimization across multiple modules, and feature implementation that touches dozens of files. These harder distributions provide a more meaningful signal as top agents approach saturation on the Verified set.[14]
SWE-bench Live's monthly update cadence represents a shift toward continuous benchmarking that keeps pace with model training cutoffs. This approach may become the standard for preventing data contamination, with evaluation sets that always contain fresh, unseen issues.[7]
Future benchmarks may extend beyond isolated issue resolution to evaluate agents across the full software development lifecycle: writing design documents, creating pull requests, responding to code review feedback, managing CI/CD pipelines, and triaging incoming bug reports. SWE-Lancer's freelance simulation is one early step in this direction.
Current benchmarks measure fully autonomous performance, but most practical deployments involve human-AI collaboration. Future evaluation frameworks may measure how effectively an agent assists a human developer rather than replacing them entirely, capturing metrics like time saved, suggestion acceptance rate, and code quality improvement.
SWE-bench complements and is often compared with other code generation and software engineering benchmarks:
| Benchmark | Focus | Size | Languages |
|---|---|---|---|
| HumanEval | Function-level code completion | 164 problems | Python |
| MBPP | Basic Python programming | 974 problems | Python |
| CodeContests | Competitive programming | 13,328 problems | Multiple |
| DS-1000 | Data science coding | 1,000 problems | Python |
| RepoEval | Repository-level code completion | 1,600 problems | Python |
| CrossCodeEval | Cross-file code completion | 9,928 problems | Python, Java, C#, TypeScript |
| LiveCodeBench | Contamination-resistant competitive coding | Continuously updated | Python |
| SWE-bench+ | SWE-bench with leakage removed | Filtered subset | Python |
| SWE-rebench | Dynamic, decontaminated evaluation | Continuously updated | Python |
| SWE-Lancer | Freelance software work, dollar-weighted | 1,400+ tasks | Multiple |
| GAIA | General AI assistants | 466 questions | Multiple |
| AgentBench | LLM agent capabilities | 8 environments | Multiple |