SWE-bench
Last reviewed
May 8, 2026
Sources
38 citations
Review status
Source-backed
Revision
v6 ยท 9,258 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 8, 2026
Sources
38 citations
Review status
Source-backed
Revision
v6 ยท 9,258 words
Add missing citations, update stale details, or suggest a clearer explanation.
| SWE-bench | |
|---|---|
| Overview | |
| Full name | Software Engineering Benchmark |
| Abbreviation | SWE-bench |
| Description | A benchmark for evaluating large language models and AI agents on real-world software engineering tasks from GitHub |
| Release date | 2023-10-10 |
| Latest variant | SWE-bench Live (monthly refresh), SWE-bench Pro (Scale AI) |
| Authors | Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, Karthik Narasimhan |
| Organization | Princeton University, University of Chicago, Stanford University |
| Technical details | |
| Type | Software Engineering, Code Generation, Bug Fixing |
| Modality | Text, Code |
| Task format | Issue resolution, Code editing |
| Number of tasks | 2,294 |
| Total examples | 2,294 (Full), 500 (Verified), 300 (Lite), 619 (Multimodal), 1,565 (Live), 1,865 (Pro) |
| Evaluation metric | % Resolved, Test Pass Rate |
| Domains | Software Engineering, Python Programming, Open Source Development |
| Languages | Python (Original/Verified/Lite), JavaScript (Multimodal), 9+ languages (Multilingual/Live/Pro) |
| Performance | |
| Baseline (Oct 2023) | 1.96% (Claude 2 with BM25) |
| SOTA (Verified) | 87.6% (Claude Opus 4.7) |
| SOTA date | 2026-04 |
| Saturated | Yes (per OpenAI February 2026) |
| Resources | |
| Website | Official website |
| Paper | arXiv:2310.06770 |
| GitHub | Repository |
| Dataset | Hugging Face download |
| License | MIT License |
SWE-bench (Software Engineering Benchmark) is an evaluation framework that tests whether AI systems can resolve real-world software engineering tasks drawn from actual GitHub issues and pull requests. It was introduced in October 2023 by Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan, with most of the team based at Princeton University and collaborators at Stanford University. The original dataset contains 2,294 task instances collected from 12 popular open-source Python repositories, including Django, SymPy, scikit-learn, and Matplotlib.
Unlike earlier code generation benchmarks such as HumanEval and MBPP, which test whether a model can write a single function from a docstring, SWE-bench asks an AI agent to read a real bug report or feature request, navigate a codebase with thousands of files, identify the relevant locations, write a patch, and have that patch pass the project's hidden test suite. The benchmark is graded automatically by running the patch against unit tests, so the AI either resolves the issue or it does not. There is no partial credit for nice-looking code that fails the tests.
The SWE-bench paper, titled "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?", was published at the International Conference on Learning Representations (ICLR) 2024 with an oral presentation, one of the highest distinctions at that venue. The benchmark immediately became a fixed point on the AI capability map: by late 2024 every frontier coding model release reported a SWE-bench number, and by 2025 the SWE-bench Verified subset had become the de facto standard line in nearly every model release announcement from Anthropic, OpenAI, Google, DeepSeek, Moonshot, Alibaba, and Meta. With over 2 million downloads from Hugging Face and adoption by leading AI research organizations worldwide, the benchmark has shaped how the industry measures progress in autonomous coding.
Progress on SWE-bench has been one of the steepest capability climbs in AI benchmark history. The original 2023 paper reported a peak resolution rate of 1.96% for Claude 2 with BM25 retrieval, leading the authors to conclude that frontier models could only solve the simplest issues. By April 2026, Claude Opus 4.7 reached 87.6% on SWE-bench Verified, a roughly 45-fold improvement in two and a half years. The pace of gains, combined with later evidence of training-data leakage, prompted OpenAI to formally deprecate Verified for frontier evaluations in February 2026 and to recommend the harder SWE-bench Pro variant instead.
Before SWE-bench, AI code generation benchmarks were almost entirely function-level. HumanEval, introduced by OpenAI in 2021, asked models to complete 164 short standalone Python functions. MBPP (Mostly Basic Python Programming) followed a similar pattern with about 1,000 simple problems. Both were saturating fast: by 2023 frontier models were scoring above 80% on HumanEval, and the field needed a harder question.
Anyone who had used an AI coding assistant on a real software project in 2023 already knew the gap was huge. Writing a function from a clean docstring is one thing; opening a 50,000-file repository, reading a bug report that might or might not include a stack trace, hunting down the responsible code, and shipping a fix that passes a project's existing test suite is another. The skills do not transfer cleanly. A model that scores 90% on HumanEval might fail on the simplest real PR in a Django ticket queue.
The Princeton team's insight was that open-source repositories on GitHub already contain a complete, naturally occurring record of software engineering work in the form of issues paired with the merged pull requests that resolve them. Each such pair is essentially a software engineering problem with both a known good solution (the merged diff) and an automatic grading rubric (the test cases the merged PR added or modified). By collecting these pairs and replaying them in controlled environments, the team could build a benchmark that mirrors what professional developers actually do, with no manual labeling and no synthetic problems.
The paper posed the question directly in its title: can large language models resolve real-world GitHub issues? At the time of release, the answer was sobering. Claude 2, the strongest model the team tested, resolved 1.96% of tasks when given files retrieved through BM25, a standard information retrieval algorithm. With oracle retrieval, where the model received the exact files that needed editing, performance topped out at 4.80%. Fine-tuned variants of CodeLlama-7B and CodeLlama-13B (branded SWE-Llama) performed comparably or slightly worse despite training on the SWE-bench-train companion set. The picture in late 2023 was clear: frontier LLMs understood Python syntax but struggled to operate inside production repositories.
The SWE-bench team chose 12 popular, actively maintained, well-tested Python repositories spanning different software domains. The repositories were picked for size, maintenance quality, and the existence of comprehensive test suites that could serve as automated graders.
| Repository | Domain | Task instances | Share of dataset |
|---|---|---|---|
| django/django | Web framework | 850 | 37.1% |
| sympy/sympy | Symbolic mathematics | 386 | 16.8% |
| scikit-learn/scikit-learn | Machine learning | 229 | 10.0% |
| sphinx-doc/sphinx | Documentation generator | 187 | 8.2% |
| matplotlib/matplotlib | Data visualization | 184 | 8.0% |
| pytest-dev/pytest | Testing framework | 119 | 5.2% |
| pydata/xarray | Labeled array data | 110 | 4.8% |
| astropy/astropy | Astronomy library | 95 | 4.1% |
| pylint-dev/pylint | Code linter | 57 | 2.5% |
| psf/requests | HTTP library | 44 | 1.9% |
| mwaskom/seaborn | Statistical visualization | 22 | 1.0% |
| pallets/flask | Web microframework | 11 | 0.5% |
| Total | 2,294 | 100% |
Django alone accounts for 37% of all tasks, and the top three repositories (Django, SymPy, and scikit-learn) account for nearly 64%. This concentration matters for evaluation: a model's SWE-bench score is disproportionately shaped by its ability to work with Django's codebase, conventions, and testing patterns. The skew is even more pronounced in SWE-bench Verified, where Django climbs to 46.2% of tasks. A model that is excellent at Django but mediocre everywhere else can post a misleadingly high score. Flask contributes the fewest instances (11) due to its smaller codebase and less frequent issue activity. The 12 repositories together contain hundreds of thousands of files; even after retrieval, an agent typically operates inside a working tree of 5,000 to 50,000 source files.
Each task instance is derived from a pull request that resolves one or more GitHub issues. The construction pipeline filtered roughly 90,000 PRs scraped from these 12 repositories down to 2,294 high-quality instances using these stages:
The team also released SWE-bench-train, a companion training set with about 19,000 non-testing instances drawn from 37 repositories, which gives researchers a larger pool for fine-tuning experiments without contaminating the evaluation set.
A key property of the construction is that every issue is paired with a real human solution. This gold patch defines what "correct" means at the file level (which files were changed, how many lines were added or removed) and at the behavioral level (which tests now pass). Because a known-good human solution exists, SWE-bench can score agents end-to-end without relying on stylistic similarity metrics like BLEU. The unit tests are the judge.
SWE-bench uses Docker containers to ensure reproducible and isolated evaluation. Without containerization, comparing scores across labs would be effectively impossible because Python dependency resolution is famously fragile, and a single difference in NumPy or pandas versions can flip which tests pass.
The harness builds images in three layers: a base image with shared dependencies, about 60 environment images covering different repository version combinations, and per-task instance images with task-specific dependency pins. The standard evaluation flow then runs through these steps for each task:
git apply or equivalent tooling.A patch that addresses the issue but breaks an unrelated test counts as a failure, which discourages over-aggressive refactors. A patch that does not apply cleanly because of formatting drift or whitespace also counts as a failure.
The primary metric is % Resolved, the percentage of task instances where the agent's patch causes all FAIL_TO_PASS tests to pass without breaking any PASS_TO_PASS tests. Additional metrics tracked by the community include:
Cost reporting has become an increasingly important secondary metric. Two agents at 70% resolution rate are not equivalent if one consumes $0.50 per task and the other $15. Recent agent papers commonly publish per-instance dollar cost alongside the headline accuracy figure.
In January 2025 the SWE-bench team added cloud-based evaluation through Modal, removing the need for local Docker infrastructure. Researchers can run evaluations entirely in the cloud by installing the swebench[modal] package and setting the --modal true flag.
For leaderboard submissions, the team released sb-cli, a command-line tool that standardizes the submission process. After authenticating with sb login, researchers submit predictions using sb submit --predictions <path>, and the evaluation runs on centralized infrastructure to ensure consistent and reproducible results.
The original paper used two retrieval modes for providing code to the model. BM25 retrieval uses the Pyserini BM25 retriever to select relevant files from the repository based on the issue text, with a 27,000-token context budget (measured with OpenAI's cl100k_base tokenizer). In roughly 40% of instances BM25 retrieves a superset of the files that actually need editing, but in nearly half of cases it fails to retrieve any of the needed files. Oracle retrieval gives the model the exact files modified in the gold patch, providing an upper bound on performance when file localization is perfect.
Modern agent-based approaches have largely moved beyond static retrieval. Today's agents interactively browse the repository, search for definitions, run shell commands, and navigate the codebase. Tools like ripgrep, language-server queries, and AST-aware code search have become standard, and modern context windows of 200K to 1M tokens make it practical to load substantial slices of a repository at once.
The SWE-bench team and outside contributors have produced a family of related datasets that share the same evaluation harness but differ in size, language coverage, contamination resistance, and difficulty. The table below summarizes the major variants.
| Variant | Released | Size | Languages | Notes |
|---|---|---|---|---|
| SWE-bench (Full) | Oct 2023 | 2,294 | Python | Original benchmark; 12 repositories |
| SWE-bench Lite | Mar 2024 | 300 | Python | Cheaper functional bug-fix subset |
| SWE-bench Verified | Aug 2024 | 500 | Python | Human-curated by 93 reviewers; OpenAI collaboration |
| SWE-bench Multimodal | Oct 2024 | 619 | JavaScript | Issues with images and screenshots |
| SWE-bench Multilingual | 2025 | 300 | 9 languages | C, C++, Go, Java, JS, TS, PHP, Ruby, Rust |
| SWE-bench Live | May 2025 | 1,565+ | 8+ languages | Monthly refresh; anti-contamination |
| Multi-SWE-bench | 2025 | 1,632 | 7 languages | ByteDance Seed; NeurIPS 2025 D&B |
| SWE-bench Pro | Sep 2025 | 1,865 | Multiple | Scale AI; long-horizon, commercial codebases |
| SWE-bench+ | Oct 2024 | Filtered subset | Python | OpenLM.ai; leakage removed |
| SWE-rebench | 2025 | Continuously updated | Python | Decontamination focus |
| Aider Polyglot | Jul 2024 | 225 | C++, Go, Java, JS, Python, Rust | Aider's instruction-following coding test |
SWE-bench Lite is a 300-instance subset created in March 2024 to make evaluation faster and more accessible. The instances focus on self-contained, functional bug fixes that can be resolved with targeted code changes, making it well-suited for rapid prototyping and iterative agent development. A full evaluation run on Lite takes a fraction of the time of the full benchmark.
Despite its name, Lite is not trivial. As of April 2026, the leading score on SWE-bench Lite was 62.7% by Claude Opus 4.6, with MiniMax M2.5 in second at 56.3%, well short of the 80%+ scores recorded on the larger Verified set. The gap exists because Lite was constructed from the original 2,294-instance pool and includes some of the same noisy task descriptions that the Verified curation effort later filtered out.
Released on August 13, 2024, in collaboration with OpenAI Preparedness, SWE-bench Verified contains 500 instances individually reviewed by 93 experienced software developers. Each task was checked against four criteria: clear problem description, unambiguous solution, adequate test coverage, and reasonable difficulty. By filtering out noisy tasks, Verified gives a cleaner signal of agent capability and quickly became the most widely cited variant.
The curation process screened 1,699 candidate problems with three independent expert reviews per problem. About 38.3% of samples were flagged for underspecified problem statements, 61.1% for unit tests that might unfairly mark valid solutions as incorrect. In total, roughly 68.3% of original samples were removed, leaving the curated 500. The first official baseline reported by OpenAI on Verified was 33.2% for GPT-4o paired with the Agentless scaffold, a substantial jump compared with the few-percent scores typical on the full benchmark.
According to an analysis by Epoch AI, 39% of SWE-bench Verified tasks are "trivial changes" requiring fewer than 15 minutes of human effort, 52% are "small changes" estimated at 15 minutes to one hour, only about 8% fall into the 1-to-4-hour range, and just three instances were estimated to require more than four hours. Quick fixes average around 5 changed lines of code, while the longer tasks average roughly 50 lines. This task-difficulty profile matters when reading the leaderboard. A model resolving 80% of Verified is solving mostly small, well-specified bugs, not architectural overhauls.
Verified became the default reporting standard for nearly every frontier coding model release between late 2024 and early 2026. Anthropic's Claude 3.5 Sonnet, Claude 3.7 Sonnet, Claude Sonnet 4.5, Claude Opus 4.5, Claude Opus 4.6, and Claude Opus 4.7 all reported headline numbers on Verified, as did OpenAI's GPT-4o, o1, o3, GPT-5, and GPT-5.4; Google's Gemini 2.5 Pro and Gemini 3 family; DeepSeek-R1 and DeepSeek-V4; MiniMax M2 and M2.5; Moonshot's Kimi K2; Alibaba's Qwen series; and Meta's Llama variants.
The original 2,294-instance set had two well-known problems. First, the test suites for some issues did not actually verify the bug being reported, so a model could pass without truly fixing the issue. Second, some issue descriptions were so vague that even the original human author had needed back-and-forth in the comments to clarify the requirement. These problems combined to inject noise into model rankings and made small score differences hard to interpret.
The Verified curation, undertaken in collaboration with OpenAI Preparedness, addressed both. Ninety-three software developers reviewed each candidate task and labeled it on four axes: problem clarity, solution unambiguity, test coverage, and difficulty plausibility. Tasks failing any axis were removed. The result was a 500-instance set where a passing patch could be confidently interpreted as a real fix, not a lucky overfit to thin tests.
This curation is what made Verified the standard reporting target. From late 2024 through 2025, every frontier model release at Anthropic, OpenAI, Google DeepMind, DeepSeek, Moonshot, Alibaba, Meta, and Microsoft Research reported a Verified number alongside its other coding benchmarks, and product launches for Cursor, Claude Code, Cognition Devin, Amazon Q, and others highlighted Verified scores in their first-day marketing.
Introduced in October 2024 (arXiv:2410.03859), SWE-bench Multimodal extends the benchmark to tasks where the issue report contains visual elements. The dataset comprises 619 task instances drawn from 17 user-facing JavaScript repositories covering web UI design, data visualization, digital art, and mapping. Across all task instances, there are 862 images embedded in problem statements, including code screenshots (194), diagrams (107), error messages (54), digital art (38), maps (35), and data visualizations (28). The variant tests whether AI systems can ground visual cues to specific codebase entities, for example recognizing a rendering bug from a screenshot and tracing it to the responsible CSS or JavaScript code.
Multimodal scores have lagged Verified by a wide margin because the task requires both vision and code reasoning, and because the JavaScript repositories use different testing frameworks (Jest, Mocha, Playwright) than the Python set, making patch validation harder.
SWE-bench Multilingual extends evaluation beyond Python to 9 programming languages: C, C++, Go, Java, JavaScript, TypeScript, PHP, Ruby, and Rust. It contains 300 tasks across 42 repositories and follows the same collection strategy and evaluation protocol as the original benchmark. The deliberately smaller size keeps evaluations quick to run.
Developed by ByteDance Seed and accepted to the NeurIPS 2025 Datasets and Benchmarks track, Multi-SWE-bench is a separate multilingual effort containing 1,632 high-quality instances across Java, TypeScript, JavaScript, Go, Rust, C, and C++. The instances were carefully annotated from 2,456 candidates by 68 expert annotators, giving it broader coverage than SWE-bench Multilingual at higher annotation quality.
SWE-bench Live, developed by Microsoft Research and announced in May 2025, addresses data contamination by restricting the dataset to issues created after January 2024. Because these issues postdate the training cutoffs of most models in circulation when the benchmark launched, they provide a contamination-free signal. The platform now contains 1,565 task instances spanning 164 repositories across Python, C, C++, C#, Java, Go, JavaScript, TypeScript, and Rust, with both Linux and Windows runners. The dataset is updated monthly, adding 50 newly verified high-quality issues each cycle. A lite subset samples 50 instances per month from October 2024 to March 2025, yielding a compact 300-instance set that balances recency with evaluation efficiency. To guard against test flakiness, the validation process is repeated multiple times and only instances with consistent results across all runs are retained.
Introduced by Scale AI in September 2025 (arXiv:2509.16941), SWE-bench Pro is designed as a more rigorous successor that better reflects real-world software engineering difficulty. It expands to 1,865 long-horizon tasks across public, held-out, and commercial codebases. The benchmark is partitioned into three subsets:
| Subset | Tasks | Repositories | Description |
|---|---|---|---|
| Public | 731 | 11 open-source (GPL-licensed) | Freely accessible for research |
| Held-out | 858 | 12 repositories | Reserved for leaderboard evaluation |
| Private (Commercial) | 276 | 18 proprietary startup codebases | Partnership with private companies |
The use of GPL-licensed code for the public set is a deliberate contamination-resistance strategy: the strong copyleft license acts as a legal deterrent against including the code in model training data. Tasks are explicitly chosen to require 1 to 4 hours of human effort or more, in contrast with Verified's bias toward sub-hour fixes. Pro also ships a standardized agent scaffold called SEAL (Standardized Evaluation for Agentic LLMs) so that scores reflect model capability rather than scaffolding tricks.
Top models scored around 23% on Pro's public set in early reports. By April 2026 the leaders had climbed into the mid-60s, well below their Verified scores in the high 80s. The gap is widely cited as evidence that Verified scores are no longer informative for frontier evaluation.
SWE-bench+ is an OpenLM.ai effort that filters the original benchmark to remove instances showing signs of solution leakage. Its analyses, discussed in the criticism section below, helped seed the broader contamination conversation. SWE-rebench is a separate contamination-resistant evaluation platform that publishes its own continuously updated leaderboard and uses different sampling strategies than SWE-bench Live.
Although not part of the SWE-bench family proper, the Aider Polyglot benchmark is the closest cousin in spirit. It uses 225 of the hardest Exercism problems across C++, Go, Java, JavaScript, Python, and Rust to test instruction-following whole-file edits. Polyglot scores are commonly reported alongside SWE-bench numbers in model release announcements, particularly by labs that want to highlight multi-language coding ability without committing to the heavier infrastructure required for SWE-bench Pro.
The headline figures across SWE-bench variants from 2023 to 2026 trace one of the steepest capability climbs in any AI benchmark. The numbers below come from official model release announcements or the SWE-bench leaderboard, not press headlines.
| Date | Best % Resolved | Leading model / agent | Variant | Significance |
|---|---|---|---|---|
| Oct 2023 | 1.96% | Claude 2 (BM25 retrieval) | Full | Benchmark release; Princeton baseline |
| Mar 2024 | 13.86% | Devin (Cognition Labs) | Lite | First commercial agent above 10% |
| Apr 2024 | 12.47% | SWE-agent + GPT-4 | Full | Open-source agent baseline (NeurIPS 2024) |
| Aug 2024 | 33.2% | GPT-4o + Agentless | Verified | OpenAI launches Verified subset |
| Oct 2024 | 49.0% | Claude 3.5 Sonnet (new) | Verified | First model to cross 49% |
| Dec 2024 | 53.0% | OpenAI o1 + scaffold | Verified | Reasoning models gain ground |
| Jan 2025 | 49.2% | DeepSeek-R1 | Verified | First open-weights model in the 49% range |
| Feb 2025 | 70.3% | Claude 3.7 Sonnet | Verified | Extended thinking helps |
| May 2025 | 72.5% / 72.7% | Claude Opus 4 / Sonnet 4 | Verified | New SOTA from Anthropic |
| Jul 2025 | 74.9% | GPT-5 | Verified | OpenAI's first GPT-5 number |
| Aug 2025 | 74.5% | Claude Opus 4.1 | Verified | Focused upgrade |
| Sep 2025 | 77.2% | Claude Sonnet 4.5 | Verified | Sonnet tier passes Opus 4 |
| Nov 2025 | 80.9% | Claude Opus 4.5 | Verified | First confirmed 80%+ score |
| Nov 2025 | 76.2% | Gemini 3 Pro | Verified | Google's first Gemini 3 number |
| Feb 2026 | 85.0% | GPT-5.3 Codex | Verified | OpenAI peaks before deprecation |
| Apr 2026 | 87.6% | Claude Opus 4.7 | Verified | Current public SOTA |
The trajectory is striking. In just over thirty months the best score rose from 1.96% to 87.6%, roughly 45-fold. Three concurrent advances drove the climb: stronger base models, better agent scaffolding, and richer tool use. Each leap on the table can usually be attributed to one of those three factors. The Devin number in March 2024 came mostly from agent scaffolding (the underlying model was GPT-4 or Claude class). The Claude 3.5 Sonnet jump from 33% to 49% in October 2024 came largely from the model itself. The 70% to 80% climb across 2025 came from a mix of all three plus the rise of explicit reasoning chains in OpenAI o1, OpenAI o3, and Claude's extended-thinking modes.
The public SWE-bench Verified leaderboard is in flux because many labs stopped submitting after OpenAI's February 2026 deprecation announcement. The April 2026 snapshot below combines submissions to swebench.com, llm-stats.com, and lab-reported scores. Anthropic's Claude Mythos Preview, an internal cybersecurity-focused model that the company has stated will not be made generally available, posted 93.9% but is excluded from the table because it is not a public release.
| Rank | Model / agent | Organization | % Resolved | Date |
|---|---|---|---|---|
| 1 | Claude Opus 4.7 | Anthropic | 87.6% | 2026-04 |
| 2 | GPT-5.3 Codex | OpenAI | 85.0% | 2026-03 |
| 3 | Claude Opus 4.5 | Anthropic | 80.9% | 2026-03 |
| 4 | Claude Opus 4.6 | Anthropic | 80.8% | 2026-03 |
| 5 | Gemini 3.1 Pro | Google DeepMind | 80.6% | 2026-02 |
| 5 | DeepSeek-V4-Pro-Max | DeepSeek | 80.6% | 2026-02 |
| 7 | MiniMax M2.5 | MiniMax | 80.2% | 2026-02 |
| 7 | Kimi K2.6 | Moonshot AI | 80.2% | 2026-03 |
| 9 | GPT-5.2 | OpenAI | 80.0% | 2026-02 |
| 10 | Claude Sonnet 4.6 | Anthropic | 79.6% | 2026-03 |
| 11 | DeepSeek-V4-Flash-Max | DeepSeek | 79.0% | 2026-02 |
| 12 | Qwen3.6 Plus | Alibaba | 78.8% | 2026-04 |
| 13 | MiMo-V2-Pro | Xiaomi | 78.0% | 2026-03 |
| 13 | Gemini 3 Flash | Google DeepMind | 78.0% | 2026-02 |
| 15 | GLM-5 | Zhipu AI | 77.8% | 2026-04 |
Performance on Pro is substantially lower than on Verified, reflecting both the harder long-horizon tasks and the absence of contamination.
| Rank | Model / agent | Organization | % Resolved |
|---|---|---|---|
| 1 | Claude Opus 4.7 | Anthropic | 64.3% |
| 2 | GPT-5.4 (xHigh) | OpenAI | 59.1% |
| 3 | GPT-5.3 Codex | OpenAI | 56.8% |
| 4 | GPT-5.2 Codex | OpenAI | 56.4% |
| 5 | GPT-5.2 | OpenAI | 55.6% |
| Rank | Model / agent | Organization | % Resolved |
|---|---|---|---|
| 1 | Claude Opus 4.6 | Anthropic | 62.7% |
| 2 | MiniMax M2.5 | MiniMax | 56.3% |
| 3 | Claude Sonnet 4.6 | Anthropic | 55.0% |
| 4 | GPT-5.3 Codex | OpenAI | 53.7% |
| 5 | Gemini 3.1 Pro | 52.0% |
The gap between Verified scores (around 80% to 88%) and Pro scores (around 56% to 64%) highlights that harder, less-contaminated benchmarks still present significant challenges. Lite numbers run below Verified because Lite retains noisy task descriptions that the Verified curation later filtered out, so even an oracle agent struggles to push much above the mid-60s without leniency from the harness.
Publishing top-of-leaderboard accuracy without cost has become controversial. A reasoning-heavy agent running Claude Opus 4.7 with 200K thinking tokens per task can spend $10 to $20 per attempt; a faster Sonnet-class scaffold may achieve 65% to 70% at well under $1 per task. Several papers in 2025 and 2026 have proposed Pareto-front reporting, plotting accuracy against dollar cost rather than ranking purely by accuracy. Scale AI's Pro leaderboard added cost columns in early 2026 to encourage more honest comparisons.
SWE-bench tasks need more than a language model generating code. They need an agent that can interact with a codebase, explore files, run tests, and iterate on solutions. The benchmark spawned a thriving ecosystem of open-source agent frameworks.
SWE-agent, developed by the same Princeton team behind SWE-bench, is the official open-source baseline. Published at NeurIPS 2024, it introduced the concept of an Agent-Computer Interface (ACI): a set of custom shell commands designed to make repository navigation, code viewing, and editing easier for language models.
The architecture works as follows:
SWE-agent supports multiple LLM backends including GPT-4, Claude, and open-source models. When paired with Claude Opus 4.5, the Live-SWE-agent scaffold achieves 79.2% on SWE-bench Verified. The paper's enduring contribution is the argument that agent performance depends as much on interface design (which commands, what feedback, how truncation works) as on the underlying model. A weaker model with a good ACI can outperform a stronger model with a clumsy one.
Aider is a popular open-source command-line coding assistant by Paul Gauthier that pairs a chat interface with whole-file edits. It uses tree-sitter to map repository structure and offers a benchmark mode that runs SWE-bench instances. While Aider is designed primarily as an interactive tool rather than a fully autonomous agent, its Polyglot benchmark and per-model leaderboard helped popularize cost-adjusted reporting and the practice of comparing the same scaffold across many models.
OpenHands, formerly OpenDevin, is a community-driven open-source agent framework that consolidated several research scaffolds into a shared platform. By integrating browser, shell, and editor tools and supporting multiple LLM backends, OpenHands has been used to reproduce and extend submissions from Anthropic, Mistral, and academic labs. It powers a number of mid-tier entries on the public leaderboard.
Devin, announced by Cognition Labs in March 2024, was the first commercial AI software engineering agent to gain mass attention. On its initial SWE-bench evaluation, Devin resolved 13.86% of tasks unassisted (79 of 570 tested), far exceeding the previous best of 1.96% (unassisted) and 4.80% (assisted with oracle retrieval). Notably, 72% of Devin's successful resolutions took over 10 minutes, indicating that its ability to iterate, run tests, and refine solutions contributed to its performance.
Devin's launch demo, which showed the agent shipping a Bun benchmark, debugging a YOLO model, and posting on Upwork, generated extraordinary press coverage and helped Cognition raise more than $175 million. Subsequent independent evaluations were more skeptical: an October 2024 review by Answer.AI found Devin completed 3 of 20 real-world tasks, with several runs ending in malformed PRs. Even so, the launch marked the start of public competition over SWE-bench scores as a marketing channel.
Anthropic has reported the strongest sustained results on SWE-bench Verified across the Claude 3.5, 4, 4.5, 4.6, and 4.7 generations. The lab's Claude Code terminal agent, launched in early 2025, was tuned partly with SWE-bench-style harnesses and is the reference scaffold for Anthropic's reported numbers. By late 2025, Claude Code's bash tool, file editor, and computer use capabilities allowed it to resolve a wide range of issues with relatively shallow scaffolding, with much of the heavy lifting performed by the model itself.
| Agent | Developer | Approach |
|---|---|---|
| Amazon Q Developer Agent | Amazon | Enterprise-integrated agent with AWS tooling |
| Atlassian Rovo Dev | Atlassian | Agentic coding within Jira/Bitbucket ecosystem |
| Cursor | Anysphere | IDE-based agent with human-in-the-loop editing |
| Codex | OpenAI | Cloud-based agent running in sandboxed environments |
| Augment Code (Auggie CLI) | Augment | Context-aware agent for large enterprise codebases |
| Moatless Tools | Independent | Lightweight scaffold popular for low-cost runs |
| Agentless | Princeton/UIUC | Pipeline-based; no agent loop, used for early Verified baselines |
| AutoCodeRover | NUS/Stanford | Spectrum-based fault localization plus targeted edits |
| CodeR | Independent | Multi-agent design with role specialization |
| Mini-SWE-agent | Princeton | Minimalist 100-line agent using only bash; >74% on Verified |
| iSWE-Agent | IBM | Specialized agent for Java issue resolution on Multi-SWE-bench |
A notable recent direction is the minimalist approach exemplified by Mini-SWE-agent, which achieves competitive scores using only bash commands and no custom tools. This argues that, at the frontier, model quality matters more than scaffolding sophistication. The same model running through a sophisticated harness and through a bash-only loop can land within a few points of itself, while two different models in the same harness can differ by twenty points.
Beyond commercial agents, academic research has explored several innovative directions:
Over 94% of SWE-bench issues were filed before the knowledge cutoff dates of major pre-trained language models. This raises the risk that models may have encountered the issues, the discussions, or even the solution code during pre-training. SWE-bench Live and SWE-bench Pro attempt to address this by using newer issues, but the original benchmark and its Verified subset remain potentially contaminated, as confirmed by OpenAI's February 2026 audit.
In late 2025 OpenAI ran a careful audit and found concrete evidence of contamination. Every frontier model the lab tested (GPT-5.2, Claude Opus 4.5, Gemini 3 Flash) could reproduce verbatim portions of the gold patch or problem statement when prompted with only a Verified Task ID. In the django__django-14725 case, GPT-5.2 demonstrated knowledge of an edit_only parameter that was introduced in Django 4.1 release notes and was not derivable from the issue description. In other cases, the model emitted exact class and method names, exact early-return conditions, and other implementation details that could only have come from training data.
OpenAI built an automated red-teaming setup that tasked GPT-5 with probing GPT-5.2-Chat, Claude Opus 4.5, and Gemini 3 Flash for contamination across each Verified question. The results showed GPT-5.2 could solve 31 tasks the lab classified as "almost impossible to solve" without contamination.
Research by the SWE-bench+ team (OpenLM.ai) found that approximately 32.67% of successful patches on the original benchmark involved solution leakage, where the issue report or its comments contained the solution code or strong hints. In a broader analysis, roughly 60% of resolved instances showed some form of direct or indirect solution leakage. When these problematic instances were filtered out, SWE-agent + GPT-4's resolution rate dropped from 12.47% to 3.97%, almost a four-fold reduction.
An empirical study ("Are 'Solved Issues' in SWE-bench Really Solved Correctly?", arXiv:2503.15223) found that over 15% of SWE-bench Verified instances have incomplete test patches that allow incorrect or partial solutions to pass. Specifically, 12.5% of passing patches were functionally or semantically incorrect, and 9.82% were incomplete, addressing only part of the issue or lacking necessary error handling. Frameworks like UTBoost and PatchDiff revealed that leaderboard scores may be inflated by 6 to 7 percentage points due to these test inadequacies.
OpenAI's later audit reported even higher rates among the hardest unsolved tasks. In a careful review of 138 problems repeatedly missed by frontier models, more than 60% were unsolvable as stated. Forty-nine tests were too narrow, rejecting functionally correct submissions, while twenty-six tests were too wide, requiring features that were never mentioned in the issue. Roughly 59.4% of the hardest unsolved problems had flawed test cases, meaning further accuracy gains were no longer reliably measuring model improvements.
The SWE-bench leaderboard relies heavily on self-reported results. As of early 2026, none of the 77 entries on the Verified leaderboard had been independently verified. This creates room for cherry-picking favorable configurations or evaluation conditions. Independent reproduction efforts (SWE-rebench, OpenHands trials) have at times found significant discrepancies between reported and reproduced scores.
Performance on SWE-bench depends heavily on the agent scaffolding wrapped around the underlying model, not just the model itself. The same model can achieve very different scores depending on how it explores the repository, how many attempts it gets, and how it uses tool calls. This makes fair comparison hard unless the scaffolding is standardized, which Pro's SEAL evaluation partially addresses.
The original SWE-bench, Verified, and Lite are Python-only. Performance does not necessarily generalize to Java, C++, TypeScript, or Go. The 12 repositories, while popular, are all open-source Python projects with strong test cultures, narrow compared with the full software ecosystem. Many real-world codebases have sparse test coverage, proprietary dependencies, or architectural patterns not represented in these repositories. With Django at 37% to 46% of tasks depending on the variant, scores are disproportionately influenced by one specific codebase. As a result, high SWE-bench scores may not predict performance on arbitrary production codebases. Multilingual, Multi-SWE-bench, and Live partially address the language gap but remain less widely adopted than Verified.
Epoch AI's analysis showed that the majority of SWE-bench Verified tasks are relatively simple: 91% can be completed by a human in under one hour, and the median gold patch changes only a handful of lines of code. This means that the benchmark primarily measures an agent's ability to fix straightforward bugs rather than tackle complex architectural challenges or large feature implementations.
Running a full SWE-bench evaluation is resource-intensive, requiring at minimum 120 GB of free disk space, 16 GB of RAM, and 8 CPU cores (with 32 GB RAM recommended for parallel execution). The Docker image build process and per-instance container setup add significant overhead. Cloud evaluation through Modal reduces the hardware burden but introduces monetary costs.
With multiple labs reporting Verified scores in the 80s and Anthropic posting an internal 93.9% with Claude Mythos Preview, the benchmark has clearly entered the saturation regime. Saturation does not mean coding is solved; it means the gap between the best models and the score ceiling has shrunk to within the noise of the harness. OpenAI's deprecation post acknowledged this directly, framing the move to Pro as a response to saturation rather than a critique of the benchmark's design.
On February 23, 2026, OpenAI's Mia Glaese and Olivia Watkins published a blog post announcing that the lab would no longer report SWE-bench Verified scores for frontier model releases and would instead recommend SWE-bench Pro as the new community standard. Three findings drove the decision.
First, the unsolved tasks contained a high proportion of broken specifications. OpenAI conducted a careful audit of 138 problems repeatedly missed by frontier models and found that more than 60% were unsolvable as stated. Forty-nine tests were too narrow, rejecting functionally correct submissions, while twenty-six tests were too wide, requiring features that were never mentioned in the issue. Roughly 59.4% of the hardest unsolved problems had flawed test cases.
Second, every frontier model OpenAI tested could reproduce verbatim portions of the gold patch when prompted with a task ID, evidence of training-data leakage. Models implicated in the audit included GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash, all of which appeared to have memorized parts of the dataset. In one striking example, GPT-5.2's chain-of-thought traces revealed knowledge of unspecified test requirements, suggesting the test patches had appeared somewhere in training data.
Third, the benchmark was simply saturating. With public scores in the high 80s and internal scores in the 90s, headroom no longer justified continued reporting, especially when the upper bound appeared to be limited by harness errors rather than capability gaps.
OpenAI's recommendation was to migrate to SWE-bench Pro, which uses 1,865 longer tasks across more diverse codebases and showed substantially less contamination evidence in their tests. Several other labs (Google DeepMind, Anthropic, Meta) continued to publish Verified scores after the announcement but began including Pro numbers as well. The joint reporting pattern is likely to persist through 2026 as the community converges on a successor.
SWE-bench fundamentally changed how the AI community thinks about code generation evaluation. Before 2023, the field was dominated by function-level benchmarks like HumanEval and MBPP, which test a narrow and somewhat artificial skill. SWE-bench showed that real software engineering is qualitatively different from writing isolated functions, and it provided the first rigorous way to measure progress on the harder task.
Several specific impacts stand out:
Agentic evaluation. SWE-bench was one of the first benchmarks to require an agent (not just a model) to interact with an environment over multiple steps. This pushed the field toward building and evaluating AI agents rather than just language models. Most modern "agent" research traces its evaluation lineage back to SWE-bench.
Industry adoption as a marketing standard. SWE-bench scores became a standard metric in press releases and marketing materials for AI coding products. When Cognition launched Devin in March 2024 with a SWE-bench score of 13.86%, it generated enormous attention. When Claude 3.5 Sonnet hit 49% in October 2024, it was headline news. The benchmark gave companies a concrete, legible number to compete on. Every Anthropic, OpenAI, Google DeepMind, DeepSeek, MiniMax, Moonshot, and Alibaba release through 2025 and 2026 has reported a Verified number alongside its other coding benchmarks.
Research direction. SWE-bench influenced research priorities by highlighting that code generation was only part of the problem. Repository understanding, long-context reasoning, multi-file editing, and test-driven development became active research areas largely because SWE-bench made them measurable. The agent scaffolding ecosystem (SWE-agent, OpenHands, Aider, Auggie, Cursor's agent mode, Claude Code) grew up explicitly around SWE-bench-style tasks.
Investment and product development. Venture capital firms have referenced SWE-bench leaderboard positions when evaluating AI coding startups. Companies building AI coding assistants (Cursor, GitHub Copilot, Augment Code, Codeium, Replit Agent, OpenHands, Aider) use SWE-bench as a north-star metric for agent capability, driving investment in agentic features beyond simple code completion. Enterprise buyers assessing AI coding assistants for internal use often run SWE-bench evaluations as part of their procurement process.
Benchmark evolution. The trajectory from SWE-bench to Lite to Verified to Multimodal to Multilingual to Live to Pro illustrates a broader pattern in AI evaluation: benchmarks get saturated or contaminated, prompting harder and cleaner successors. Each iteration has pushed the frontier of what AI coding evaluation means, and the same pattern has played out across MMLU, GPQA, and other knowledge benchmarks.
Academic impact. The original paper was selected for oral presentation at ICLR 2024, one of the top machine learning conferences, reflecting its significance to the field. It has been cited in hundreds of research publications, and it spawned an entire family of related benchmarks (Verified, Lite, Multimodal, Multilingual, Live, Pro) plus third-party derivatives like SWE-bench+ and SWE-rebench.
SWE-bench occupies a particular slice of the coding evaluation landscape. The table below contrasts it with other widely cited benchmarks.
| Benchmark | Granularity | Repo context | Agent loop | Languages | Typical task length |
|---|---|---|---|---|---|
| HumanEval | Function | None | No | Python | < 1 minute |
| MBPP | Function | None | No | Python | < 1 minute |
| Codeforces | Algorithm | Problem only | No | Multiple | Minutes to hours |
| CodeContests | Algorithm | Problem only | No | Multiple | Minutes |
| LiveCodeBench | Algorithm (timed) | Problem only | No | Python | Minutes |
| BigCodeBench | Function with libraries | Single file | No | Python | Minutes |
| RepoBench | Line completion | Repository | No | Python, Java | Seconds |
| CrossCodeEval | Cross-file completion | Repository | No | Py, Java, C#, TS | Seconds |
| SWE-bench | Issue resolution | Full repository | Yes | Python | 5 to 60 minutes |
| SWE-bench Pro | Issue resolution | Full repository | Yes | Multiple | 1 to 4+ hours |
| SWE-Lancer | Freelance project | Full repository, real money | Yes | Multiple | Minutes to days |
| Aider Polyglot | Whole-file edits | Single problem file | Limited | 6 languages | Minutes |
| RealCode | Multi-step coding | Curated repo | Yes | Multiple | Minutes to hours |
The contrast with HumanEval is the cleanest illustration of how the field has matured. HumanEval asks: "Given this docstring, write the function body." SWE-bench asks: "Given this issue, find the relevant code in a 50,000-file repository, understand why a test is failing, write a multi-file patch, and verify it passes the test suite without breaking anything else." The first is solved by autocomplete-quality code generation. The second requires planning, navigation, and self-correction, which is why SWE-bench became the canonical agent benchmark while HumanEval is now mostly a smoke test.
Launched by OpenAI in February 2025, SWE-Lancer evaluates models on more than 1,400 real freelance software engineering tasks scraped from Upwork and verified by Expensify, with payouts totaling roughly $1 million in real dollars. Tasks range from $50 bug fixes to $32,000 feature implementations and split between independent contributor (IC) work and managerial decisions over technical proposals. Initial scores were modest: Claude 3.5 Sonnet completed 26.2% of IC tasks and 44.9% of managerial tasks, while OpenAI's o1 reached 20.3% and 46.3% respectively. The benchmark introduced the unusual practice of mapping AI capability to dollars earned, a metric that resonates outside academic circles.
Subsequent variants, SWE-Lancer Diamond and Pro, focus on harder Upwork tasks that took the original human freelancers more than ten hours, with stricter test harnesses and human review of edge cases. By April 2026, top frontier scores on the Diamond split had climbed into the 30s but still trailed the headline numbers reported on Verified.
LiveSWEBench and SWE-rebench follow the same general philosophy as SWE-bench Live: refresh the dataset with newer issues to outpace training cutoffs. They differ in repository selection, harness design, and update cadence. Both have published their own leaderboards through 2025 and 2026.
Several companies have built proprietary internal versions of SWE-bench using their own codebases. Anthropic, Google, and Meta have all alluded to such sets in technical reports. Scale AI's SWE-bench Pro includes a private split that customers can run against frontier models to generate scores that are not subject to public leakage. The pattern points toward a future where public benchmarks function as broad capability checkpoints while private, contamination-resistant evaluations drive enterprise procurement.
LiveCodeBench, BigCodeBench, RealCode, and Codeforces-style competitive coding sit alongside SWE-bench in the typical model release scorecard. Most frontier announcements through 2025 and 2026 report at least three of: SWE-bench Verified, Aider Polyglot, LiveCodeBench, and a Codeforces Elo score. Together they paint a picture across function-level competence, multi-file edits, real-world issues, and competitive algorithmic skill.
| Benchmark | Focus | Size | Languages |
|---|---|---|---|
| HumanEval | Function-level code completion | 164 problems | Python |
| MBPP | Basic Python programming | 974 problems | Python |
| CodeContests | Competitive programming | 13,328 problems | Multiple |
| DS-1000 | Data science coding | 1,000 problems | Python |
| RepoEval | Repository-level code completion | 1,600 problems | Python |
| CrossCodeEval | Cross-file code completion | 9,928 problems | Python, Java, C#, TypeScript |
| LiveCodeBench | Contamination-resistant competitive coding | Continuously updated | Python |
| SWE-bench+ | SWE-bench with leakage removed | Filtered subset | Python |
| SWE-rebench | Dynamic, decontaminated evaluation | Continuously updated | Python |
| SWE-Lancer | Freelance software work, dollar-weighted | 1,400+ tasks | Multiple |
| GAIA | General AI assistants | 466 questions | Multiple |
| AgentBench | LLM agent capabilities | 8 environments | Multiple |
The SWE-bench team and independent researchers continue to extend the benchmark to additional programming languages. SWE-bench Multilingual already covers 9 languages, and Multi-SWE-bench adds more annotated instances for Java, TypeScript, and other ecosystems. Future work may include languages like Python's ML stack (C/CUDA extensions), mobile development languages (Swift, Kotlin), and systems programming languages.
SWE-bench Pro and similar efforts aim to move beyond simple bug fixes toward tasks that require deeper reasoning: large refactoring operations, security vulnerability remediation, performance optimization across multiple modules, and feature implementation that touches dozens of files. These harder distributions provide a more meaningful signal as top agents approach saturation on the Verified set.
SWE-bench Live's monthly update cadence represents a shift toward continuous benchmarking that keeps pace with model training cutoffs. This approach may become the standard for preventing data contamination, with evaluation sets that always contain fresh, unseen issues.
Future benchmarks may extend beyond isolated issue resolution to evaluate agents across the full software development lifecycle: writing design documents, creating pull requests, responding to code review feedback, managing CI/CD pipelines, and triaging incoming bug reports. SWE-Lancer's freelance simulation is one early step in this direction.
Current benchmarks measure fully autonomous performance, but most practical deployments involve human-AI collaboration. Future evaluation frameworks may measure how effectively an agent assists a human developer rather than replacing them entirely, capturing metrics like time saved, suggestion acceptance rate, and code quality improvement.