SWE-bench

AI Benchmarks Software Development

50 min read

Updated Jul 14, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 14, 2026

Fact-checked

In review queue

Sources

38 citations

Revision

v10 · 10,043 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

SWE-bench
Overview
Full name	Software Engineering Benchmark
Abbreviation	SWE-bench
Description	A benchmark for evaluating large language models and AI agents on real-world software engineering tasks from GitHub
Release date	2023-10-10
Latest variant	SWE-bench Live (monthly refresh), SWE-bench Pro (Scale AI)
Authors	Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, Karthik Narasimhan
Organization	Princeton University, University of Chicago, Stanford University
Technical details
Type	Software Engineering, Code Generation, Bug Fixing
Modality	Text, Code
Task format	Issue resolution, Code editing
Number of tasks	2,294
Total examples	2,294 (Full), 500 (Verified), 300 (Lite), 619 (Multimodal), 1,565 (Live), 1,865 (Pro)
Evaluation metric	% Resolved, Test Pass Rate
Domains	Software Engineering, Python Programming, Open Source Development
Languages	Python (Original/Verified/Lite), JavaScript (Multimodal), 9+ languages (Multilingual/Live/Pro)
Performance
Baseline (Oct 2023)	1.96% (Claude 2 with BM25)
SOTA (Verified, public)	88.7% (GPT-5.5)
SOTA (Verified, restricted)	93.9% (Claude Mythos Preview, Project Glasswing only)
SOTA (Pro public)	64.3% (Claude Opus 4.7)
SOTA date	2026-04
Saturated	Yes (per OpenAI February 2026)
Resources
Website	Official website
Paper	arXiv:2310.06770
GitHub	Repository
Dataset	Hugging Face download
License	MIT License

SWE-bench (Software Engineering Benchmark) is an evaluation framework that tests whether AI systems can resolve real-world software engineering tasks drawn from actual GitHub issues and pull requests. It was introduced in October 2023 by Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan, with most of the team based at Princeton University and collaborators at Stanford University. The original dataset contains 2,294 task instances collected from 12 popular open-source Python repositories, including Django, SymPy, scikit-learn, and Matplotlib, and given a codebase plus an issue description, a model must edit the repository so that the project's hidden test suite passes.^[1] The benchmark was deliberately hard at launch: the original paper reported that "the best-performing model, Claude 2, is able to solve a mere 1.96% of the issues," the single number that defined how far autonomous coding had to go.^[1]

Unlike earlier code generation benchmarks such as HumanEval and MBPP, which test whether a model can write a single function from a docstring, SWE-bench asks an AI agent to read a real bug report or feature request, navigate a codebase with thousands of files, identify the relevant locations, write a patch, and have that patch pass the project's hidden test suite. The benchmark is graded automatically by running the patch against unit tests, so the AI either resolves the issue or it does not. There is no partial credit for nice-looking code that fails the tests.

The SWE-bench paper, titled "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?", was published at the International Conference on Learning Representations (ICLR) 2024 with an oral presentation, one of the highest distinctions at that venue.^[1] The benchmark immediately became a fixed point on the AI capability map: by late 2024 every frontier coding model release reported a SWE-bench number, and by 2025 the SWE-bench Verified subset had become the de facto standard line in nearly every model release announcement from Anthropic, OpenAI, Google, DeepSeek, Moonshot, Alibaba, and Meta. With over 2 million downloads from Hugging Face and adoption by leading AI research organizations worldwide, the benchmark has shaped how the industry measures progress in autonomous coding.

Progress on SWE-bench has been one of the steepest capability climbs in AI benchmark history. The original 2023 paper reported a peak resolution rate of 1.96% for Claude 2 with BM25 retrieval, leading the authors to conclude that frontier models could only solve the simplest issues.^[1] By mid-April 2026, Claude Opus 4.7 reached 87.6% on SWE-bench Verified and GPT-5.5 overtook it a week later at 88.7%^[32], a roughly 45-fold improvement from baseline in two and a half years. Anthropic's restricted-access Claude Mythos Preview reached 93.9% but is not publicly available; it is offered only through Project Glasswing, an invitation-only partner program for around 50 critical-infrastructure security teams.^[33] The pace of gains, combined with growing evidence of training-data leakage and brittle test cases, prompted OpenAI to formally deprecate Verified for frontier evaluations in February 2026 and to recommend the harder SWE-bench Pro variant instead.^[20]

What is SWE-bench?

SWE-bench is an automatic, test-based benchmark that measures whether an AI agent can fix a real GitHub issue inside a full software repository. The official project describes it in one line: "SWE-bench tests AI systems' ability to solve GitHub issues."^[3] Each of the 2,294 task instances pairs a genuine issue report with the merged pull request that resolved it; the agent is scored only by whether its patch makes the project's FAIL_TO_PASS tests pass without breaking the PASS_TO_PASS regression tests.^[1] In contrast with function-completion benchmarks, a SWE-bench task can require coordinating changes across multiple functions, classes, and files, and operating inside a working tree of 5,000 to 50,000 source files. That end-to-end, repository-level structure is why SWE-bench, rather than HumanEval, became the canonical benchmark for autonomous coding agents.

Background

Before SWE-bench, AI code generation benchmarks were almost entirely function-level. HumanEval, introduced by OpenAI in 2021, asked models to complete 164 short standalone Python functions. MBPP (Mostly Basic Python Programming) followed a similar pattern with about 1,000 simple problems. Both were saturating fast: by 2023 frontier models were scoring above 80% on HumanEval, and the field needed a harder question.

Anyone who had used an AI coding assistant on a real software project in 2023 already knew the gap was huge. Writing a function from a clean docstring is one thing; opening a 50,000-file repository, reading a bug report that might or might not include a stack trace, hunting down the responsible code, and shipping a fix that passes a project's existing test suite is another. The skills do not transfer cleanly. A model that scores 90% on HumanEval might fail on the simplest real PR in a Django ticket queue.

The Princeton team's insight was that open-source repositories on GitHub already contain a complete, naturally occurring record of software engineering work in the form of issues paired with the merged pull requests that resolve them. Each such pair is essentially a software engineering problem with both a known good solution (the merged diff) and an automatic grading rubric (the test cases the merged PR added or modified). By collecting these pairs and replaying them in controlled environments, the team could build a benchmark that mirrors what professional developers actually do, with no manual labeling and no synthetic problems.

The paper posed the question directly in its title: can large language models resolve real-world GitHub issues? At the time of release, the answer was sobering. Claude 2, the strongest model the team tested, resolved 1.96% of tasks when given files retrieved through BM25, a standard information retrieval algorithm.^[1] With oracle retrieval, where the model received the exact files that needed editing, performance topped out at 4.80%.^[1] Fine-tuned variants of CodeLlama-7B and CodeLlama-13B (branded SWE-Llama) performed comparably or slightly worse despite training on the SWE-bench-train companion set. The picture in late 2023 was clear: frontier LLMs understood Python syntax but struggled to operate inside production repositories. As the authors put it, "both state-of-the-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues."^[1]

Methodology

Repository selection

The SWE-bench team chose 12 popular, actively maintained, well-tested Python repositories spanning different software domains. The repositories were picked for size, maintenance quality, and the existence of comprehensive test suites that could serve as automated graders.

Repository	Domain	Task instances	Share of dataset
django/django	Web framework	850	37.1%
sympy/sympy	Symbolic mathematics	386	16.8%
scikit-learn/scikit-learn	Machine learning	229	10.0%
sphinx-doc/sphinx	Documentation generator	187	8.2%
matplotlib/matplotlib	Data visualization	184	8.0%
pytest-dev/pytest	Testing framework	119	5.2%
pydata/xarray	Labeled array data	110	4.8%
astropy/astropy	Astronomy library	95	4.1%
pylint-dev/pylint	Code linter	57	2.5%
psf/requests	HTTP library	44	1.9%
mwaskom/seaborn	Statistical visualization	22	1.0%
pallets/flask	Web microframework	11	0.5%
Total		2,294	100%

Django alone accounts for 37% of all tasks, and the top three repositories (Django, SymPy, and scikit-learn) account for nearly 64%. This concentration matters for evaluation: a model's SWE-bench score is disproportionately shaped by its ability to work with Django's codebase, conventions, and testing patterns. The skew is even more pronounced in SWE-bench Verified, where Django climbs to 46.2% of tasks. A model that is excellent at Django but mediocre everywhere else can post a misleadingly high score. Flask contributes the fewest instances (11) due to its smaller codebase and less frequent issue activity. The 12 repositories together contain hundreds of thousands of files; even after retrieval, an agent typically operates inside a working tree of 5,000 to 50,000 source files.

Task instance creation

Each task instance is derived from a pull request that resolves one or more GitHub issues. The construction pipeline filtered roughly 90,000 PRs scraped from these 12 repositories down to 2,294 high-quality instances using these stages:

Pull request collection. The team scraped merged PRs from each repository that explicitly referenced resolved issues (typically through phrases like "closes #1234" or "fixes #1234").
Test identification. For each PR, the pipeline identifies FAIL_TO_PASS tests, which fail on the pre-fix codebase but pass after the fix is applied. These tests act as the validators that the bug was actually fixed.
Regression test extraction. The pipeline also identifies PASS_TO_PASS tests that pass both before and after the fix. These ensure that a candidate solution does not break unrelated functionality.
Environment snapshotting. Each task records the exact repository commit state before the fix, plus the dependency versions and Python version needed to reproduce the environment.
Validation. Every task instance is validated by running both FAIL_TO_PASS and PASS_TO_PASS tests against the gold patch (the actual PR diff) to confirm that the tests behave as expected.

The team also released SWE-bench-train, a companion training set with about 19,000 non-testing instances drawn from 37 repositories, which gives researchers a larger pool for fine-tuning experiments without contaminating the evaluation set.

A key property of the construction is that every issue is paired with a real human solution. This gold patch defines what "correct" means at the file level (which files were changed, how many lines were added or removed) and at the behavioral level (which tests now pass). Because a known-good human solution exists, SWE-bench can score agents end-to-end without relying on stylistic similarity metrics like BLEU. The unit tests are the judge.

Evaluation harness

SWE-bench uses Docker containers to ensure reproducible and isolated evaluation. Without containerization, comparing scores across labs would be effectively impossible because Python dependency resolution is famously fragile, and a single difference in NumPy or pandas versions can flip which tests pass.

The harness builds images in three layers: a base image with shared dependencies, about 60 environment images covering different repository version combinations, and per-task instance images with task-specific dependency pins. The standard evaluation flow then runs through these steps for each task:

A Docker container is started with the appropriate environment image.
The repository is checked out at the exact commit that precedes the gold fix.
The issue description text is provided to the AI agent or model.
The system produces a proposed code patch, typically in unified diff format.
The patch is applied with git apply or equivalent tooling.
Both FAIL_TO_PASS and PASS_TO_PASS test suites are run against the patched codebase.
A task is marked as resolved only if all FAIL_TO_PASS tests now pass and no PASS_TO_PASS tests are broken.

A patch that addresses the issue but breaks an unrelated test counts as a failure, which discourages over-aggressive refactors. A patch that does not apply cleanly because of formatting drift or whitespace also counts as a failure.

Metrics

The primary metric is % Resolved, the percentage of task instances where the agent's patch causes all FAIL_TO_PASS tests to pass without breaking any PASS_TO_PASS tests. Additional metrics tracked by the community include:

Pass@k: the success rate when the agent is allowed k independent attempts per task.
Test pass rate: the fraction of individual test cases passed, offering a more granular view than binary resolve/not-resolve.
Regression rate: how often an agent's patch breaks previously passing tests.
Cost and efficiency: token consumption, API call count, wall-clock time, and financial cost per resolved instance.

Cost reporting has become an increasingly important secondary metric. Two agents at 70% resolution rate are not equivalent if one consumes $0.50 per task and the other $15. Recent agent papers commonly publish per-instance dollar cost alongside the headline accuracy figure.

Cloud evaluation and sb-cli

In January 2025 the SWE-bench team added cloud-based evaluation through Modal, removing the need for local Docker infrastructure. Researchers can run evaluations entirely in the cloud by installing the swebench[modal] package and setting the --modal true flag.

For leaderboard submissions, the team released sb-cli, a command-line tool that standardizes the submission process. After authenticating with sb login, researchers submit predictions using sb submit --predictions <path>, and the evaluation runs on centralized infrastructure to ensure consistent and reproducible results.

Context retrieval

The original paper used two retrieval modes for providing code to the model. BM25 retrieval uses the Pyserini BM25 retriever to select relevant files from the repository based on the issue text, with a 27,000-token context budget (measured with OpenAI's cl100k_base tokenizer). In roughly 40% of instances BM25 retrieves a superset of the files that actually need editing, but in nearly half of cases it fails to retrieve any of the needed files. Oracle retrieval gives the model the exact files modified in the gold patch, providing an upper bound on performance when file localization is perfect.

Modern agent-based approaches have largely moved beyond static retrieval. Today's agents interactively browse the repository, search for definitions, run shell commands, and navigate the codebase. Tools like ripgrep, language-server queries, and AST-aware code search have become standard, and modern context windows of 200K to 1M tokens make it practical to load substantial slices of a repository at once.

Variants

The SWE-bench team and outside contributors have produced a family of related datasets that share the same evaluation harness but differ in size, language coverage, contamination resistance, and difficulty. The table below summarizes the major variants.

Variant	Released	Size	Languages	Notes
SWE-bench (Full)	Oct 2023	2,294	Python	Original benchmark; 12 repositories
SWE-bench Lite	Mar 2024	300	Python	Cheaper functional bug-fix subset
SWE-bench Verified	Aug 2024	500	Python	Human-curated by 93 reviewers; OpenAI collaboration
SWE-bench Multimodal	Oct 2024	619	JavaScript	Issues with images and screenshots
SWE-bench Multilingual	2025	300	9 languages	C, C++, Go, Java, JS, TS, PHP, Ruby, Rust
SWE-bench Live	May 2025	1,565+	8+ languages	Monthly refresh; anti-contamination
Multi-SWE-bench	2025	1,632	7 languages	ByteDance Seed; NeurIPS 2025 D&B
SWE-bench Pro	Sep 2025	1,865	Multiple	Scale AI; long-horizon, commercial codebases
SWE-bench+	Oct 2024	Filtered subset	Python	OpenLM.ai; leakage removed
SWE-rebench	2025	Continuously updated	Python	Decontamination focus
Aider Polyglot	Jul 2024	225	C++, Go, Java, JS, Python, Rust	Aider's instruction-following coding test

SWE-bench Lite

SWE-bench Lite is a 300-instance subset created in March 2024 to make evaluation faster and more accessible. The instances focus on self-contained, functional bug fixes that can be resolved with targeted code changes, making it well-suited for rapid prototyping and iterative agent development. A full evaluation run on Lite takes a fraction of the time of the full benchmark.

Despite its name, Lite is not trivial. As of April 2026, the leading score on SWE-bench Lite was 62.7% by Claude Opus 4.6, with MiniMax M2.5 in second at 56.3%, well short of the 80%+ scores recorded on the larger Verified set.^[36] The gap exists because Lite was constructed from the original 2,294-instance pool and includes some of the same noisy task descriptions that the Verified curation effort later filtered out.

SWE-bench Verified

Released on August 13, 2024, in collaboration with OpenAI Preparedness, SWE-bench Verified contains 500 instances individually reviewed by 93 experienced software developers.^[4] Each task was checked against four criteria: clear problem description, unambiguous solution, adequate test coverage, and reasonable difficulty. By filtering out noisy tasks, Verified gives a cleaner signal of agent capability and quickly became the most widely cited variant. OpenAI stated the motivation plainly: "the unit tests used to evaluate the correctness of a solution are often overly specific, and in some cases are even unrelated to the issue," and "many samples have an issue description that is underspecified, leading to ambiguity."^[4]

The curation process screened 1,699 candidate problems with three independent expert reviews per problem.^[4] About 38.3% of samples were flagged for underspecified problem statements, 61.1% for unit tests that might unfairly mark valid solutions as incorrect.^[4] In total, roughly 68.3% of original samples were removed, leaving the curated 500.^[4] The first official baseline reported by OpenAI on Verified was 33.2% for GPT-4o paired with the Agentless scaffold, which roughly doubled GPT-4o's 16% on the original benchmark and showed that the noisy original set had been understating model capability.^[4]

According to an analysis by Epoch AI, 39% of SWE-bench Verified tasks are "trivial changes" requiring fewer than 15 minutes of human effort, 52% are "small changes" estimated at 15 minutes to one hour, only about 8% fall into the 1-to-4-hour range, and just three instances were estimated to require more than four hours.^[11] Quick fixes average around 5 changed lines of code, while the longer tasks average roughly 50 lines.^[11] Epoch's blunt conclusion was that "the benchmark really tests whether AI can make simple codebase edits."^[11] This task-difficulty profile matters when reading the leaderboard. A model resolving 80% of Verified is solving mostly small, well-specified bugs, not architectural overhauls.

Verified became the default reporting standard for nearly every frontier coding model release between late 2024 and early 2026. Anthropic's Claude 3.5 Sonnet, Claude 3.7 Sonnet, Claude Sonnet 4.5, Claude Opus 4.5, Claude Opus 4.6, and Claude Opus 4.7 all reported headline numbers on Verified, as did OpenAI's GPT-4o, o1, o3, GPT-5, GPT-5.4, and GPT-5.5; Google's Gemini 2.5 Pro and Gemini 3 family; DeepSeek-R1 and DeepSeek-V4; MiniMax M2 and M2.5; Moonshot's Kimi K2; Alibaba's Qwen series; and Meta's Llama variants. Reporting persisted even after OpenAI's February 2026 deprecation post: GPT-5.5's launch on April 23, 2026 led with an 88.7% Verified number despite the lab's own recommendation that the community migrate to Pro.^[32]

Why SWE-bench Verified?

The original 2,294-instance set had two well-known problems. First, the test suites for some issues did not actually verify the bug being reported, so a model could pass without truly fixing the issue. Second, some issue descriptions were so vague that even the original human author had needed back-and-forth in the comments to clarify the requirement. These problems combined to inject noise into model rankings and made small score differences hard to interpret.

The Verified curation, undertaken in collaboration with OpenAI Preparedness, addressed both. Ninety-three software developers reviewed each candidate task and labeled it on four axes: problem clarity, solution unambiguity, test coverage, and difficulty plausibility.^[4] Tasks failing any axis were removed. The result was a 500-instance set where a passing patch could be confidently interpreted as a real fix, not a lucky overfit to thin tests.

This curation is what made Verified the standard reporting target. From late 2024 through 2025, every frontier model release at Anthropic, OpenAI, Google DeepMind, DeepSeek, Moonshot, Alibaba, Meta, and Microsoft Research reported a Verified number alongside its other coding benchmarks, and product launches for Cursor, Claude Code, Cognition Devin, Amazon Q, and others highlighted Verified scores in their first-day marketing.

SWE-bench Multimodal

Introduced in October 2024 (arXiv:2410.03859), SWE-bench Multimodal extends the benchmark to tasks where the issue report contains visual elements.^[6] The dataset comprises 619 task instances drawn from 17 user-facing JavaScript repositories covering web UI design, data visualization, digital art, and mapping.^[6] Across all task instances, there are 862 images embedded in problem statements, including code screenshots (194), diagrams (107), error messages (54), digital art (38), maps (35), and data visualizations (28). The variant tests whether AI systems can ground visual cues to specific codebase entities, for example recognizing a rendering bug from a screenshot and tracing it to the responsible CSS or JavaScript code.

Multimodal scores have lagged Verified by a wide margin because the task requires both vision and code reasoning, and because the JavaScript repositories use different testing frameworks (Jest, Mocha, Playwright) than the Python set, making patch validation harder.

SWE-bench Multilingual

SWE-bench Multilingual extends evaluation beyond Python to 9 programming languages: C, C++, Go, Java, JavaScript, TypeScript, PHP, Ruby, and Rust.^[12] It contains 300 tasks across 42 repositories and follows the same collection strategy and evaluation protocol as the original benchmark. The deliberately smaller size keeps evaluations quick to run.

Multi-SWE-bench

Developed by ByteDance Seed and accepted to the NeurIPS 2025 Datasets and Benchmarks track, Multi-SWE-bench is a separate multilingual effort containing 1,632 high-quality instances across Java, TypeScript, JavaScript, Go, Rust, C, and C++.^[13] The instances were carefully annotated from 2,456 candidates by 68 expert annotators, giving it broader coverage than SWE-bench Multilingual at higher annotation quality.^[13]

SWE-bench Live

SWE-bench Live, developed by Microsoft Research and announced in May 2025, addresses data contamination by restricting the dataset to issues created after January 2024.^[7] Because these issues postdate the training cutoffs of most models in circulation when the benchmark launched, they provide a contamination-free signal. The platform now contains 1,565 task instances spanning 164 repositories across Python, C, C++, C#, Java, Go, JavaScript, TypeScript, and Rust, with both Linux and Windows runners.^[7] The dataset is updated monthly, adding 50 newly verified high-quality issues each cycle. A lite subset samples 50 instances per month from October 2024 to March 2025, yielding a compact 300-instance set that balances recency with evaluation efficiency. To guard against test flakiness, the validation process is repeated multiple times and only instances with consistent results across all runs are retained.

SWE-bench Pro

Introduced by Scale AI in September 2025 (arXiv:2509.16941), SWE-bench Pro is designed as a more rigorous successor that better reflects real-world software engineering difficulty.^[14] It expands to 1,865 long-horizon tasks across public, held-out, and commercial codebases. The benchmark is partitioned into three subsets:

Subset	Tasks	Repositories	Description
Public	731	11 open-source (GPL-licensed)	Freely accessible for research
Held-out	858	12 repositories	Reserved for leaderboard evaluation
Private (Commercial)	276	18 proprietary startup codebases	Partnership with private companies

The use of GPL-licensed code for the public set is a deliberate contamination-resistance strategy: the strong copyleft license acts as a legal deterrent against including the code in model training data. Tasks are explicitly chosen to require 1 to 4 hours of human effort or more, in contrast with Verified's bias toward sub-hour fixes. Pro also ships a standardized agent scaffold called SEAL (Standardized Evaluation for Agentic LLMs) so that scores reflect model capability rather than scaffolding tricks.

Top models scored around 23% on Pro's public set in early reports.^[14] By April 2026 the leaders had climbed into the mid-60s, well below their Verified scores in the high 80s.^[19] The gap is widely cited as evidence that Verified scores are no longer informative for frontier evaluation.

SWE-bench+ and SWE-rebench

SWE-bench+ is an OpenLM.ai effort that filters the original benchmark to remove instances showing signs of solution leakage.^[16] Its analyses, discussed in the criticism section below, helped seed the broader contamination conversation. SWE-rebench is a separate contamination-resistant evaluation platform that publishes its own continuously updated leaderboard and uses different sampling strategies than SWE-bench Live.^[38]

Aider Polyglot

Although not part of the SWE-bench family proper, the Aider Polyglot benchmark is the closest cousin in spirit. It uses 225 of the hardest Exercism problems across C++, Go, Java, JavaScript, Python, and Rust to test instruction-following whole-file edits. Polyglot scores are commonly reported alongside SWE-bench numbers in model release announcements, particularly by labs that want to highlight multi-language coding ability without committing to the heavier infrastructure required for SWE-bench Pro.

Leaderboard and score progression

How have SWE-bench scores changed over time?

The headline figures across SWE-bench variants from 2023 to 2026 trace one of the steepest capability climbs in any AI benchmark. The numbers below come from official model release announcements or the SWE-bench leaderboard, not press headlines.

Date	Best % Resolved	Leading model / agent	Variant	Significance
Oct 2023	1.96%	Claude 2 (BM25 retrieval)	Full	Benchmark release; Princeton baseline
Mar 2024	13.86%	Devin (Cognition Labs)	Lite	First commercial agent above 10%
Apr 2024	12.47%	SWE-agent + GPT-4	Full	Open-source agent baseline (NeurIPS 2024)
Aug 2024	33.2%	GPT-4o + Agentless	Verified	OpenAI launches Verified subset
Oct 2024	49.0%	Claude 3.5 Sonnet (new)	Verified	First model to cross 49%
Dec 2024	53.0%	OpenAI o1 + scaffold	Verified	Reasoning models gain ground
Jan 2025	49.2%	DeepSeek-R1	Verified	First open-weights model in the 49% range
Feb 2025	70.3%	Claude 3.7 Sonnet	Verified	Extended thinking helps
May 2025	72.5% / 72.7%	Claude Opus 4 / Sonnet 4	Verified	New SOTA from Anthropic
Jul 2025	74.9%	GPT-5	Verified	OpenAI's first GPT-5 number
Aug 2025	74.5%	Claude Opus 4.1	Verified	Focused upgrade
Sep 2025	77.2%	Claude Sonnet 4.5	Verified	Sonnet tier passes Opus 4
Nov 2025	80.9%	Claude Opus 4.5	Verified	First confirmed 80%+ score
Nov 2025	76.2%	Gemini 3 Pro	Verified	Google's first Gemini 3 number
Feb 2026	85.0%	GPT-5.3 Codex	Verified	OpenAI peaks before deprecation
Apr 16, 2026	87.6%	Claude Opus 4.7	Verified	Anthropic briefly retakes top spot
Apr 23, 2026	88.7%	GPT-5.5	Verified	OpenAI returns to #1 one week later
Apr 2026	93.9%	Claude Mythos Preview	Verified	Restricted release via Project Glasswing

The trajectory is striking. In just over thirty months the best public score rose from 1.96% to 88.7%, roughly 45-fold. Three concurrent advances drove the climb: stronger base models, better agent scaffolding, and richer tool use. Each leap on the table can usually be attributed to one of those three factors. The Devin number in March 2024 came mostly from agent scaffolding (the underlying model was GPT-4 or Claude class). The Claude 3.5 Sonnet jump from 33% to 49% in October 2024 came largely from the model itself. The 70% to 80% climb across 2025 came from a mix of all three plus the rise of explicit reasoning chains in OpenAI o1, OpenAI o3, and Claude's extended-thinking modes. The final 80% to 88.7% climb across Q1 2026 came mostly from base-model gains, as the leading scaffolds (Claude Code, SWE-agent, Mini-SWE-agent, Codex) had largely converged and the headline differences between top-five entries shrank to within scaffolding noise.

Verified leaderboard top entries (May 2026)

The public SWE-bench Verified leaderboard is in flux because many labs stopped submitting after OpenAI's February 2026 deprecation announcement. The May 2026 snapshot below combines submissions to swebench.com, llm-stats.com, and lab-reported scores.^[18] Anthropic's Claude Mythos Preview posted 93.9% but is excluded from the ranked table because it is not generally available; access is gated through Project Glasswing, an invitation-only partner program for roughly 50 critical-infrastructure security teams announced April 7, 2026.^[33]

Rank	Model / agent	Organization	% Resolved	Date
1	GPT-5.5	OpenAI	88.7%	2026-04
2	Claude Opus 4.7	Anthropic	87.6%	2026-04
3	GPT-5.3 Codex	OpenAI	85.0%	2026-03
4	Claude Opus 4.5	Anthropic	80.9%	2026-03
5	Claude Opus 4.6	Anthropic	80.8%	2026-03
6	Gemini 3.1 Pro	Google DeepMind	80.6%	2026-02
6	DeepSeek-V4-Pro-Max	DeepSeek	80.6%	2026-02
8	MiniMax M2.5	MiniMax	80.2%	2026-02
8	Kimi K2.6	Moonshot AI	80.2%	2026-03
10	GPT-5.2	OpenAI	80.0%	2026-02
11	Claude Sonnet 4.6	Anthropic	79.6%	2026-03
12	DeepSeek-V4-Flash-Max	DeepSeek	79.0%	2026-02
13	Qwen3.6 Plus	Alibaba	78.8%	2026-04
14	MiMo-V2-Pro	Xiaomi	78.0%	2026-03
14	Gemini 3 Flash	Google DeepMind	78.0%	2026-02
16	GLM-5	Zhipu AI	77.8%	2026-04

OpenAI's GPT-5.5, released April 23, 2026, is the first publicly available model to break the 88% Verified ceiling and currently leads the public board by 1.1 points over Claude Opus 4.7.^[32] The release notes attribute the gain to a fully retrained agentic backbone and report parallel improvements on Terminal-Bench 2.0 (82.7%), GDPval (84.9%), and a roughly 60% drop in hallucination rate. The fact that OpenAI continued to report a Verified number even after publicly deprecating the benchmark in February 2026 underscores how entrenched Verified remains as a marketing line.

SWE-bench Pro public top 5 (May 2026)

Performance on Pro is substantially lower than on Verified, reflecting both the harder long-horizon tasks and the much stronger contamination resistance of the GPL-licensed and private commercial repositories.^[14]

Rank	Model / agent	Organization	% Resolved
1	Claude Opus 4.7	Anthropic	64.3%
2	GPT-5.4 (xHigh)	OpenAI	59.1%
3	GPT-5.5	OpenAI	58.6%
4	GPT-5.3 Codex	OpenAI	56.8%
5	GPT-5.2 Codex	OpenAI	56.4%

Claude Mythos Preview leads the Pro public split at 77.8% but is again excluded from public ranking under the Project Glasswing restriction.^[33] The Pro leaderboard inverts the Verified ranking at the top: Claude Opus 4.7 sits 5.7 points ahead of GPT-5.5 on Pro despite trailing by 1.1 points on Verified, which several commentators have read as evidence that Pro and Verified are now measuring meaningfully different things.

SWE-bench Lite top 5 (April 2026)

Rank	Model / agent	Organization	% Resolved
1	Claude Opus 4.6	Anthropic	62.7%
2	MiniMax M2.5	MiniMax	56.3%
3	Claude Sonnet 4.6	Anthropic	55.0%
4	GPT-5.3 Codex	OpenAI	53.7%
5	Gemini 3.1 Pro	Google	52.0%

The gap between Verified scores (around 80% to 88%) and Pro scores (around 56% to 64%) highlights that harder, less-contaminated benchmarks still present significant challenges. Lite numbers run below Verified because Lite retains noisy task descriptions that the Verified curation later filtered out, so even an oracle agent struggles to push much above the mid-60s without leniency from the harness.

Cost-adjusted performance

Publishing top-of-leaderboard accuracy without cost has become controversial. A reasoning-heavy agent running Claude Opus 4.7 with 200K thinking tokens per task can spend $10 to $20 per attempt; a faster Sonnet-class scaffold may achieve 65% to 70% at well under $1 per task. Several papers in 2025 and 2026 have proposed Pareto-front reporting, plotting accuracy against dollar cost rather than ranking purely by accuracy. Scale AI's Pro leaderboard added cost columns in early 2026 to encourage more honest comparisons.

Open-source agent frameworks

SWE-bench tasks need more than a language model generating code. They need an agent that can interact with a codebase, explore files, run tests, and iterate on solutions. The benchmark spawned a thriving ecosystem of open-source agent frameworks.

SWE-agent

SWE-agent, developed by the same Princeton team behind SWE-bench, is the official open-source baseline. Published at NeurIPS 2024, it introduced the concept of an Agent-Computer Interface (ACI): a set of custom shell commands designed to make repository navigation, code viewing, and editing easier for language models.^[8]

The architecture works as follows:

Environment setup. SWE-agent initializes a Docker container (or remote compute via Modal or AWS) through its SWE-ReX deployment layer.
Shell session. Inside the container, a persistent shell session is created. The ACI tools are installed as custom commands accessible from this session.
Agent loop. The language model receives the issue text and iteratively issues commands (search files, open files, edit code, run tests) through the shell session. After each command, the output is fed back to the model.
History compression. Because conversations grow long, a HistoryProcessor compresses the interaction history to fit within the model's context window.
Patch submission. Once the agent believes it has found and fixed the issue, it generates and submits a diff patch.

SWE-agent supports multiple LLM backends including GPT-4, Claude, and open-source models. When paired with Claude Opus 4.5, the Live-SWE-agent scaffold achieves 79.2% on SWE-bench Verified. The paper's enduring contribution is the argument that agent performance depends as much on interface design (which commands, what feedback, how truncation works) as on the underlying model. A weaker model with a good ACI can outperform a stronger model with a clumsy one.

Aider

Aider is a popular open-source command-line coding assistant by Paul Gauthier that pairs a chat interface with whole-file edits. It uses tree-sitter to map repository structure and offers a benchmark mode that runs SWE-bench instances. While Aider is designed primarily as an interactive tool rather than a fully autonomous agent, its Polyglot benchmark and per-model leaderboard helped popularize cost-adjusted reporting and the practice of comparing the same scaffold across many models.

OpenHands

OpenHands, formerly OpenDevin, is a community-driven open-source agent framework that consolidated several research scaffolds into a shared platform. By integrating browser, shell, and editor tools and supporting multiple LLM backends, OpenHands has been used to reproduce and extend submissions from Anthropic, Mistral, and academic labs. It powers a number of mid-tier entries on the public leaderboard.

Devin

Devin, announced by Cognition Labs in March 2024, was the first commercial AI software engineering agent to gain mass attention. On its initial SWE-bench evaluation, Devin resolved 13.86% of tasks unassisted (79 of 570 tested), far exceeding the previous best of 1.96% (unassisted) and 4.80% (assisted with oracle retrieval).^[15] Notably, 72% of Devin's successful resolutions took over 10 minutes, indicating that its ability to iterate, run tests, and refine solutions contributed to its performance.^[15]

Devin's launch demo, which showed the agent shipping a Bun benchmark, debugging a YOLO model, and posting on Upwork, generated extraordinary press coverage and helped Cognition raise more than $175 million. Subsequent independent evaluations were more skeptical: an October 2024 review by Answer.AI found Devin completed 3 of 20 real-world tasks, with several runs ending in malformed PRs. Even so, the launch marked the start of public competition over SWE-bench scores as a marketing channel.

Anthropic and Claude Code

Anthropic has reported the strongest sustained results on SWE-bench Verified across the Claude 3.5, 4, 4.5, 4.6, and 4.7 generations. The lab's Claude Code terminal agent, launched in early 2025, was tuned partly with SWE-bench-style harnesses and is the reference scaffold for Anthropic's reported numbers. By late 2025, Claude Code's bash tool, file editor, and computer use capabilities allowed it to resolve a wide range of issues with relatively shallow scaffolding, with much of the heavy lifting performed by the model itself.

Other notable agents

Agent	Developer	Approach
Amazon Q Developer Agent	Amazon	Enterprise-integrated agent with AWS tooling
Atlassian Rovo Dev	Atlassian	Agentic coding within Jira/Bitbucket ecosystem
Cursor	Anysphere	IDE-based agent with human-in-the-loop editing
Codex	OpenAI	Cloud-based agent running in sandboxed environments
Augment Code (Auggie CLI)	Augment	Context-aware agent for large enterprise codebases
Moatless Tools	Independent	Lightweight scaffold popular for low-cost runs
Agentless	Princeton/UIUC	Pipeline-based; no agent loop, used for early Verified baselines
AutoCodeRover	NUS/Stanford	Spectrum-based fault localization plus targeted edits
CodeR	Independent	Multi-agent design with role specialization
Mini-SWE-agent	Princeton	Minimalist 100-line agent using only bash; >74% on Verified
iSWE-Agent	IBM	Specialized agent for Java issue resolution on Multi-SWE-bench

A notable recent direction is the minimalist approach exemplified by Mini-SWE-agent, which achieves competitive scores using only bash commands and no custom tools. This argues that, at the frontier, model quality matters more than scaffolding sophistication. The same model running through a sophisticated harness and through a bash-only loop can land within a few points of itself, while two different models in the same harness can differ by twenty points.

Research approaches

Beyond commercial agents, academic research has explored several innovative directions:

Multi-agent systems. Coordinating multiple specialized agents (one for bug localization, one for patch generation, one for test validation) to divide the software engineering workflow.
Retrieval-augmented generation. Enhancing agent context with relevant code examples from the broader repository or external documentation.
Self-debugging loops. Iterative cycles where the agent generates a patch, runs tests, analyzes failures, and refines the patch until tests pass.
Tool-augmented reasoning. Integrating static analysis tools, type checkers, and linters into the agent's toolkit to catch errors before test execution.
Fine-tuning on SWE-bench-train. Using the 19,000 training instances to fine-tune smaller models for software engineering tasks, as demonstrated in the original paper with CodeLlama.
Best-of-N sampling. Generating many candidate patches and using a verifier (often an LLM-as-judge with regression testing) to pick the best one. This trades cost for accuracy and powers some of the highest leaderboard entries.
Sub-agent decomposition. Spawning sub-agents that each tackle one step (file localization, test reading, patch synthesis) and report back to a parent planner. Anthropic's Claude Code and the Cognition Devin pipeline both employ a version of this design.

Critique and contamination concerns

Is SWE-bench contaminated by training data?

Over 94% of SWE-bench issues were filed before the knowledge cutoff dates of major pre-trained language models. This raises the risk that models may have encountered the issues, the discussions, or even the solution code during pre-training. SWE-bench Live and SWE-bench Pro attempt to address this by using newer issues, but the original benchmark and its Verified subset remain potentially contaminated, as confirmed by OpenAI's February 2026 audit.^[20]

In late 2025 OpenAI ran a careful audit and found concrete evidence of contamination.^[20] Every frontier model the lab tested (GPT-5.2, Claude Opus 4.5, Gemini 3 Flash) could reproduce verbatim portions of the gold patch or problem statement when prompted with only a Verified Task ID. In the django__django-14725 case, GPT-5.2 demonstrated knowledge of an edit_only parameter that was introduced in Django 4.1 release notes and was not derivable from the issue description. In other cases, the model emitted exact class and method names, exact early-return conditions, and other implementation details that could only have come from training data.

OpenAI built an automated red-teaming setup that tasked GPT-5 with probing GPT-5.2-Chat, Claude Opus 4.5, and Gemini 3 Flash for contamination across each Verified question.^[20] The results showed GPT-5.2 could solve 31 tasks the lab classified as "almost impossible to solve" without contamination.^[20]

Solution leakage

Research by the SWE-bench+ team (OpenLM.ai) found that approximately 32.67% of successful patches on the original benchmark involved solution leakage, where the issue report or its comments contained the solution code or strong hints.^[16] In a broader analysis, roughly 60% of resolved instances showed some form of direct or indirect solution leakage.^[16] When these problematic instances were filtered out, SWE-agent + GPT-4's resolution rate dropped from 12.47% to 3.97%, almost a four-fold reduction.^[16]

Test suite weaknesses

An empirical study ("Are 'Solved Issues' in SWE-bench Really Solved Correctly?", arXiv:2503.15223) found that over 15% of SWE-bench Verified instances have incomplete test patches that allow incorrect or partial solutions to pass.^[17] Specifically, 12.5% of passing patches were functionally or semantically incorrect, and 9.82% were incomplete, addressing only part of the issue or lacking necessary error handling.^[17] Frameworks like UTBoost and PatchDiff revealed that leaderboard scores may be inflated by 6 to 7 percentage points due to these test inadequacies.^[17]

OpenAI's later audit reported even higher rates among the hardest unsolved tasks. In a careful review of 138 problems repeatedly missed by frontier models, more than 60% were unsolvable as stated.^[20] Forty-nine tests were too narrow, rejecting functionally correct submissions, while twenty-six tests were too wide, requiring features that were never mentioned in the issue.^[20] Roughly 59.4% of the hardest unsolved problems had flawed test cases, meaning further accuracy gains were no longer reliably measuring model improvements.^[20]

Self-reported scores

The SWE-bench leaderboard relies heavily on self-reported results. As of early 2026, none of the 77 entries on the Verified leaderboard had been independently verified.^[18] This creates room for cherry-picking favorable configurations or evaluation conditions. Independent reproduction efforts (SWE-rebench, OpenHands trials) have at times found significant discrepancies between reported and reproduced scores.

Scaffolding variance

Performance on SWE-bench depends heavily on the agent scaffolding wrapped around the underlying model, not just the model itself. The same model can achieve very different scores depending on how it explores the repository, how many attempts it gets, and how it uses tool calls. This makes fair comparison hard unless the scaffolding is standardized, which Pro's SEAL evaluation partially addresses.

Python-centric focus and repository selection bias

The original SWE-bench, Verified, and Lite are Python-only. Performance does not necessarily generalize to Java, C++, TypeScript, or Go. The 12 repositories, while popular, are all open-source Python projects with strong test cultures, narrow compared with the full software ecosystem. Many real-world codebases have sparse test coverage, proprietary dependencies, or architectural patterns not represented in these repositories. With Django at 37% to 46% of tasks depending on the variant, scores are disproportionately influenced by one specific codebase. As a result, high SWE-bench scores may not predict performance on arbitrary production codebases. Multilingual, Multi-SWE-bench, and Live partially address the language gap but remain less widely adopted than Verified.

How hard are SWE-bench Verified tasks?

Epoch AI's analysis showed that the majority of SWE-bench Verified tasks are relatively simple: 91% can be completed by a human in under one hour, and the median gold patch changes only a handful of lines of code.^[11] This means that the benchmark primarily measures an agent's ability to fix straightforward bugs rather than tackle complex architectural challenges or large feature implementations. Epoch summarized the point directly: "the benchmark really tests whether AI can make simple codebase edits."^[11]

Computational cost

Running a full SWE-bench evaluation is resource-intensive, requiring at minimum 120 GB of free disk space, 16 GB of RAM, and 8 CPU cores (with 32 GB RAM recommended for parallel execution).^[10] The Docker image build process and per-instance container setup add significant overhead. Cloud evaluation through Modal reduces the hardware burden but introduces monetary costs.

Is SWE-bench saturated?

With multiple labs reporting Verified scores in the 80s, GPT-5.5 publicly posting 88.7%, and Anthropic's restricted Claude Mythos Preview reaching 93.9%, the benchmark has clearly entered the saturation regime.^[32]^[33] Saturation does not mean coding is solved; it means the gap between the best models and the score ceiling has shrunk to within the noise of the harness. OpenAI's deprecation post acknowledged this directly, framing the move to Pro as a response to saturation rather than a critique of the benchmark's design.^[20] OpenAI's own audit estimated that around 59.4% of the hardest unsolved Verified problems have flawed test cases, which puts an effective ceiling on what a fair Verified score can mean somewhere in the low 90s even before contamination effects are considered.^[20]

Why did OpenAI deprecate SWE-bench Verified?

On February 23, 2026, OpenAI's Mia Glaese and Olivia Watkins published a blog post announcing that the lab would no longer report SWE-bench Verified scores for frontier model releases and would instead recommend SWE-bench Pro as the new community standard.^[20] The post stated the conclusion bluntly: "Improvements on SWE-bench Verified no longer reflect meaningful improvements in models' real-world software development abilities. Instead, they increasingly reflect how much the model was exposed to the benchmark at training time."^[20] Three findings drove the decision.

First, the unsolved tasks contained a high proportion of broken specifications. OpenAI conducted a careful audit of 138 problems repeatedly missed by frontier models and found that more than 60% were unsolvable as stated.^[20] Forty-nine tests were too narrow, rejecting functionally correct submissions, while twenty-six tests were too wide, requiring features that were never mentioned in the issue.^[20] Roughly 59.4% of the hardest unsolved problems had flawed test cases.^[20]

Second, every frontier model OpenAI tested could reproduce verbatim portions of the gold patch when prompted with a task ID, evidence of training-data leakage.^[20] Models implicated in the audit included GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash, all of which appeared to have memorized parts of the dataset. In one striking example, GPT-5.2's chain-of-thought traces revealed knowledge of unspecified test requirements, suggesting the test patches had appeared somewhere in training data.

Third, the benchmark was simply saturating. With public scores in the high 80s and internal scores in the 90s, headroom no longer justified continued reporting, especially when the upper bound appeared to be limited by harness errors rather than capability gaps.

OpenAI's recommendation was to migrate to SWE-bench Pro, which uses 1,865 longer tasks across more diverse codebases and showed substantially less contamination evidence in their tests.^[14] Several other labs (Google DeepMind, Anthropic, Meta) continued to publish Verified scores after the announcement but began including Pro numbers as well. OpenAI itself led the launch of GPT-5.5 on April 23, 2026 with an 88.7% Verified headline figure, only two months after the deprecation post.^[32] The joint reporting pattern, with Verified retained as the legibility metric and Pro increasingly relied on as the contamination-resistant signal, is likely to persist through 2026 as the community converges on a successor.

Industry impact

SWE-bench fundamentally changed how the AI community thinks about code generation evaluation. Before 2023, the field was dominated by function-level benchmarks like HumanEval and MBPP, which test a narrow and somewhat artificial skill. SWE-bench showed that real software engineering is qualitatively different from writing isolated functions, and it provided the first rigorous way to measure progress on the harder task.

Several specific impacts stand out:

Agentic evaluation. SWE-bench was one of the first benchmarks to require an agent (not just a model) to interact with an environment over multiple steps. This pushed the field toward building and evaluating AI agents rather than just language models. Most modern "agent" research traces its evaluation lineage back to SWE-bench.

Industry adoption as a marketing standard. SWE-bench scores became a standard metric in press releases and marketing materials for AI coding products. When Cognition launched Devin in March 2024 with a SWE-bench score of 13.86%, it generated enormous attention.^[15] When Claude 3.5 Sonnet hit 49% in October 2024, it was headline news. The benchmark gave companies a concrete, legible number to compete on. Every Anthropic, OpenAI, Google DeepMind, DeepSeek, MiniMax, Moonshot, and Alibaba release through 2025 and 2026 has reported a Verified number alongside its other coding benchmarks.

Research direction. SWE-bench influenced research priorities by highlighting that code generation was only part of the problem. Repository understanding, long-context reasoning, multi-file editing, and test-driven development became active research areas largely because SWE-bench made them measurable. The agent scaffolding ecosystem (SWE-agent, OpenHands, Aider, Auggie, Cursor's agent mode, Claude Code) grew up explicitly around SWE-bench-style tasks.

Investment and product development. Venture capital firms have referenced SWE-bench leaderboard positions when evaluating AI coding startups. Companies building AI coding assistants (Cursor, GitHub Copilot, Augment Code, Codeium, Replit Agent, OpenHands, Aider) use SWE-bench as a north-star metric for agent capability, driving investment in agentic features beyond simple code completion. Enterprise buyers assessing AI coding assistants for internal use often run SWE-bench evaluations as part of their procurement process.

Benchmark evolution. The trajectory from SWE-bench to Lite to Verified to Multimodal to Multilingual to Live to Pro illustrates a broader pattern in AI evaluation: benchmarks get saturated or contaminated, prompting harder and cleaner successors. Each iteration has pushed the frontier of what AI coding evaluation means, and the same pattern has played out across MMLU, GPQA, and other knowledge benchmarks.

Academic impact. The original paper was selected for oral presentation at ICLR 2024, one of the top machine learning conferences, reflecting its significance to the field.^[1] It has been cited in hundreds of research publications, and it spawned an entire family of related benchmarks (Verified, Lite, Multimodal, Multilingual, Live, Pro) plus third-party derivatives like SWE-bench+ and SWE-rebench.

Companion benchmarks

SWE-bench occupies a particular slice of the coding evaluation landscape. The table below contrasts it with other widely cited benchmarks.

Benchmark	Granularity	Repo context	Agent loop	Languages	Typical task length
HumanEval	Function	None	No	Python	< 1 minute
MBPP	Function	None	No	Python	< 1 minute
Codeforces	Algorithm	Problem only	No	Multiple	Minutes to hours
CodeContests	Algorithm	Problem only	No	Multiple	Minutes
LiveCodeBench	Algorithm (timed)	Problem only	No	Python	Minutes
BigCodeBench	Function with libraries	Single file	No	Python	Minutes
RepoBench	Line completion	Repository	No	Python	Seconds
CrossCodeEval	Cross-file completion	Repository	No	Py, Java, C#, TS	Seconds
SWE-bench	Issue resolution	Full repository	Yes	Python	5 to 60 minutes
SWE-bench Pro	Issue resolution	Full repository	Yes	Multiple	1 to 4+ hours
SWE-Lancer	Freelance project	Full repository, real money	Yes	Multiple	Minutes to days
Aider Polyglot	Whole-file edits	Single problem file	Limited	6 languages	Minutes
RealCode	Multi-step coding	Curated repo	Yes	Multiple	Minutes to hours

The contrast with HumanEval is the cleanest illustration of how the field has matured. HumanEval asks: "Given this docstring, write the function body." SWE-bench asks: "Given this issue, find the relevant code in a 50,000-file repository, understand why a test is failing, write a multi-file patch, and verify it passes the test suite without breaking anything else." The first is solved by autocomplete-quality code generation. The second requires planning, navigation, and self-correction, which is why SWE-bench became the canonical agent benchmark while HumanEval is now mostly a smoke test.

SWE-Lancer

Launched by OpenAI in February 2025, SWE-Lancer evaluates models on more than 1,400 real freelance software engineering tasks scraped from Upwork and verified by Expensify, with payouts totaling roughly $1 million in real dollars.^[35] Tasks range from $50 bug fixes to $32,000 feature implementations and split between independent contributor (IC) work and managerial decisions over technical proposals. Initial scores were modest: Claude 3.5 Sonnet completed 26.2% of IC tasks and 44.9% of managerial tasks, while OpenAI's o1 reached 20.3% and 46.3% respectively.^[35] The benchmark introduced the unusual practice of mapping AI capability to dollars earned, a metric that resonates outside academic circles.

Subsequent variants, SWE-Lancer Diamond and Pro, focus on harder Upwork tasks that took the original human freelancers more than ten hours, with stricter test harnesses and human review of edge cases. By April 2026, top frontier scores on the Diamond split had climbed into the 30s but still trailed the headline numbers reported on Verified.

LiveSWEBench and SWE-rebench

LiveSWEBench and SWE-rebench follow the same general philosophy as SWE-bench Live: refresh the dataset with newer issues to outpace training cutoffs.^[38] They differ in repository selection, harness design, and update cadence. Both have published their own leaderboards through 2025 and 2026.

Industry-specific successors

Several companies have built proprietary internal versions of SWE-bench using their own codebases. Anthropic, Google, and Meta have all alluded to such sets in technical reports. Scale AI's SWE-bench Pro includes a private split that customers can run against frontier models to generate scores that are not subject to public leakage. The pattern points toward a future where public benchmarks function as broad capability checkpoints while private, contamination-resistant evaluations drive enterprise procurement.

Other companions

LiveCodeBench, BigCodeBench, RealCode, and Codeforces-style competitive coding sit alongside SWE-bench in the typical model release scorecard. Most frontier announcements through 2025 and 2026 report at least three of: SWE-bench Verified, Aider Polyglot, LiveCodeBench, and a Codeforces Elo score. Together they paint a picture across function-level competence, multi-file edits, real-world issues, and competitive algorithmic skill.

Benchmark	Focus	Size	Languages
HumanEval	Function-level code completion	164 problems	Python
MBPP	Basic Python programming	974 problems	Python
CodeContests	Competitive programming	13,328 problems	Multiple
DS-1000	Data science coding	1,000 problems	Python
RepoEval	Repository-level code completion	1,600 problems	Python
CrossCodeEval	Cross-file code completion	9,928 problems	Python, Java, C#, TypeScript
LiveCodeBench	Contamination-resistant competitive coding	Continuously updated	Python
SWE-bench+	SWE-bench with leakage removed	Filtered subset	Python
SWE-rebench	Dynamic, decontaminated evaluation	Continuously updated	Python
SWE-Lancer	Freelance software work, dollar-weighted	1,400+ tasks	Multiple
GAIA	General AI assistants	466 questions	Multiple
AgentBench	LLM agent capabilities	8 environments	Multiple

Future directions

Expanded language coverage

The SWE-bench team and independent researchers continue to extend the benchmark to additional programming languages. SWE-bench Multilingual already covers 9 languages, and Multi-SWE-bench adds more annotated instances for Java, TypeScript, and other ecosystems. Future work may include languages like Python's ML stack (C/CUDA extensions), mobile development languages (Swift, Kotlin), and systems programming languages.

Harder task distributions

SWE-bench Pro and similar efforts aim to move beyond simple bug fixes toward tasks that require deeper reasoning: large refactoring operations, security vulnerability remediation, performance optimization across multiple modules, and feature implementation that touches dozens of files. These harder distributions provide a more meaningful signal as top agents approach saturation on the Verified set.

Continuous and live evaluation

SWE-bench Live's monthly update cadence represents a shift toward continuous benchmarking that keeps pace with model training cutoffs. This approach may become the standard for preventing data contamination, with evaluation sets that always contain fresh, unseen issues.

Integration with real development workflows

Future benchmarks may extend beyond isolated issue resolution to evaluate agents across the full software development lifecycle: writing design documents, creating pull requests, responding to code review feedback, managing CI/CD pipelines, and triaging incoming bug reports. SWE-Lancer's freelance simulation is one early step in this direction.

Human-AI collaboration metrics

Current benchmarks measure fully autonomous performance, but most practical deployments involve human-AI collaboration. Future evaluation frameworks may measure how effectively an agent assists a human developer rather than replacing them entirely, capturing metrics like time saved, suggestion acceptance rate, and code quality improvement.

References

Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., & Narasimhan, K. (2024). "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" *International Conference on Learning Representations (ICLR 2024)*. https://arxiv.org/abs/2310.06770 ↩
ICLR 2024 Conference Program. SWE-bench oral presentation listing.
SWE-bench Evaluation Documentation. https://www.swebench.com/SWE-bench/guides/evaluation/ ↩
OpenAI. (2024). "Introducing SWE-bench Verified." https://openai.com/index/introducing-swe-bench-verified/ ↩
SWE-bench Lite. https://www.swebench.com/lite.html
Yang, J. et al. (2024). "SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?" arXiv:2410.03859. https://arxiv.org/abs/2410.03859 ↩
Microsoft Research. (2025). "SWE-bench Goes Live!" arXiv:2505.23419. https://arxiv.org/abs/2505.23419 ↩
Yang, J. et al. (2024). "SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering." *NeurIPS 2024*. https://swe-agent.com/ ↩
Princeton Language and Intelligence. (2023). "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" https://pli.princeton.edu/blog/2023/swe-bench-can-language-models-resolve-real-world-github-issues
SWE-bench Docker Setup and Cloud Evaluation. https://www.swebench.com/SWE-bench/guides/docker_setup/ ↩
Epoch AI. (2024). "What skills does SWE-bench Verified evaluate?" https://epoch.ai/blog/what-skills-does-swe-bench-verified-evaluate ↩
SWE-bench Multilingual. https://www.swebench.com/multilingual.html ↩
ByteDance Seed. (2025). "Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving." *NeurIPS 2025 Datasets and Benchmarks*. arXiv:2504.02605. https://arxiv.org/abs/2504.02605 ↩
Jimenez, C. E. et al. (2025). "SWE-bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?" arXiv:2509.16941. https://arxiv.org/abs/2509.16941 ↩
Cognition Labs. (2024). "SWE-bench Technical Report." https://cognition.ai/blog/swe-bench-technical-report ↩
OpenLM.ai. (2024). "SWE-Bench+: Enhanced Coding Benchmark for LLMs." arXiv:2410.06992. https://arxiv.org/abs/2410.06992 ↩
"Are 'Solved Issues' in SWE-bench Really Solved Correctly? An Empirical Study." (2025). arXiv:2503.15223. https://arxiv.org/abs/2503.15223 ↩
SWE-bench Verified leaderboard, llm-stats.com (April 2026). https://llm-stats.com/benchmarks/swe-bench-verified ↩
Scale Labs. SWE-bench Pro Leaderboard (Public). https://labs.scale.com/leaderboard/swe_bench_pro_public ↩
OpenAI. (2026). "Why SWE-bench Verified no longer measures frontier coding capabilities." https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/ ↩
Latent Space. (2026). "The End of SWE-Bench Verified, with Mia Glaese & Olivia Watkins." https://www.latent.space/p/swe-bench-dead
SWE-bench Live project page. https://swe-bench-live.github.io/
SWE-bench GitHub Repository. https://github.com/SWE-bench/SWE-bench
Anthropic. (2025). "Claude 3.5 Sonnet (new) and computer use." Model release notes. https://www.anthropic.com/news/3-5-models-and-computer-use
Anthropic. (2025). "Claude 3.7 Sonnet and Claude Code." Model release notes. https://www.anthropic.com/news/claude-3-7-sonnet
Anthropic. (2025). "Claude Opus 4 and Sonnet 4." Model release notes. https://www.anthropic.com/news/claude-4
Anthropic. (2025). "Claude Sonnet 4.5." Model release notes. https://www.anthropic.com/news/claude-sonnet-4-5
Anthropic. (2025). "Claude Opus 4.5." Model release notes. https://www.anthropic.com/news/claude-opus-4-5
Anthropic. (2026). "Claude Opus 4.7." Model release notes. https://www.anthropic.com/news/claude-opus-4-7
OpenAI. (2025). "Introducing GPT-5." Model release notes. https://openai.com/index/introducing-gpt-5/
Google DeepMind. (2025). "Gemini 3." Model announcement. https://deepmind.google/technologies/gemini/
OpenAI. (2026). "Introducing GPT-5.5." Model release notes, April 23, 2026. https://openai.com/index/introducing-gpt-5-5/ ↩
Anthropic. (2026). "Claude Mythos and Project Glasswing." Announcement, April 7, 2026. https://www.anthropic.com/news/claude-mythos ↩
DeepSeek-AI. (2025). "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv:2501.12948. https://arxiv.org/abs/2501.12948
OpenAI. (2025). "Introducing the SWE-Lancer Benchmark." https://openai.com/index/swe-lancer/ ↩
PricePerToken. (2026). "SWE-bench Lite Leaderboard 2026." https://pricepertoken.com/leaderboards/benchmark/swe-bench-lite ↩
SWE-bench dataset on Hugging Face. https://huggingface.co/datasets/princeton-nlp/SWE-bench
SWE-rebench Leaderboard. https://swe-rebench.com ↩

External links

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

9 revisions by 1 contributors · full history

Suggest edit