SWE-Bench Pro

19 min read

Updated Jul 23, 2026

SWE-Bench Pro (stylized SWE-BENCH PRO) is a contamination-resistant benchmark, released by Scale AI in September 2025, that measures whether an AI coding agent can resolve long-horizon, multi-file software engineering tasks drawn from real professional codebases. It contains 1,865 problems sourced from 41 actively maintained repositories spanning business applications, B2B services, and developer tools, partitioned into a public set, a held-out set, and a commercial set of proprietary codebases.^[1]^[2] At launch, the best frontier model (OpenAI GPT-5) resolved just 23.3 percent of public-set tasks, down from above 70 percent on the older SWE-bench Verified, making SWE-Bench Pro one of the hardest widely reported coding benchmarks of the 2025-2026 period.^[1]^[11] Tasks are described in the source paper as long-horizon problems that "may require hours to days for a professional software engineer to complete," with reference solutions touching an average of 4.1 files and modifying around 107 lines of code.^[1]^[3]

SWE-Bench Pro was built as a successor to the influential SWE-bench and SWE-bench Verified benchmarks, explicitly to address two problems that had eroded the signal of earlier coding benchmarks by late 2025: training-data contamination of public test suites, and saturation by frontier models. Following an OpenAI audit of SWE-bench Verified in early 2026 that identified widespread test flaws and evidence of contamination, OpenAI publicly stopped reporting Verified scores and recommended that frontier-model developers report SWE-Bench Pro instead.^[4]^[5] By mid-2026, SWE-Bench Pro had become a de facto industry standard for measuring frontier coding-agent capability, with the Scale Labs leaderboard tracking submissions from major laboratories including Anthropic, OpenAI, and Google DeepMind across both the public and commercial dataset splits.^[6]^[7]

What is SWE-Bench Pro?

SWE-Bench Pro is a test-based evaluation in the SWE-bench lineage: for each task an agent receives a real issue's problem statement and the pre-fix snapshot of a repository, and must produce a patch that makes a hidden test suite pass without breaking existing functionality. What distinguishes it from earlier benchmarks is the difficulty and provenance of the tasks. The source paper introduces it as "a substantially more challenging benchmark that builds upon the best practices of SWE-Bench, but is explicitly designed to capture realistic, complex, enterprise-level problems beyond the scope of SWE-Bench," and frames the result as "a contamination-resistant testbed that more faithfully captures the complexity and diversity of real-world software development."^[2]^[3]

Three properties define the benchmark: tasks are long-horizon and multi-file (averaging 4.1 files and 107.4 lines of code changed); the underlying repositories are chosen to be unlikely to appear in model pretraining data; and a large share of the dataset (the held-out and commercial splits) is never released publicly, so it cannot be scraped into training corpora.^[1]^[3]

Background

SWE-bench and SWE-bench Verified

The original SWE-bench was released in October 2023 by Princeton University researchers and introduced 2,294 issue-resolution tasks drawn from twelve popular Python open-source repositories on GitHub. Each instance pairs a real GitHub issue and its associated codebase with one or more test cases that verify that a candidate patch resolves the issue without breaking existing functionality.^[8] SWE-bench rapidly became a flagship evaluation for measuring the practical coding capabilities of large language models, because unlike contemporary code benchmarks focused on isolated function synthesis, it required agents to navigate a multi-file repository, identify the relevant code, and produce a working patch.

In August 2024, OpenAI released SWE-bench Verified, a 500-instance subset of SWE-bench filtered through human annotation to remove problems with under-specified issues or overly restrictive tests. SWE-bench Verified became the most widely cited coding benchmark across vendor announcements through 2024 and 2025, with reported scores climbing from below 30 percent in early 2024 to above 80 percent for top frontier models by mid-2025.^[4]

Why was SWE-Bench Pro created?

By 2025, several issues with the SWE-bench family were becoming acute. The first was saturation: top scores on SWE-bench Verified rose from roughly 75 percent to over 80 percent over a six-month window, and the spread between frontier laboratories' systems compressed to within a few percentage points, limiting the benchmark's ability to differentiate truly capable systems from incrementally improved ones.^[4]^[9] Vendor announcements increasingly relied on small differences in reported Verified scores, while independent reproductions varied by 10 percentage points or more depending on the agent scaffolding used to drive the underlying model.

The second issue was contamination. Because SWE-bench Verified is a fixed, fully public set of 500 problems drawn from open-source repositories that have been hosted on GitHub for years, the benchmark's problem statements, repository contents, and gold-standard patches are all present in the data used to pretrain modern language models. An audit performed by OpenAI's evaluation team and published in February 2026 reported that contemporary frontier models, including GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash Preview, could reproduce original gold-patch solutions from training memory when given only a SWE-bench Verified task identifier as a prompt.^[4]^[5] To gauge test quality, OpenAI audited the 138 hardest Verified problems, which its o3 model could not solve consistently across 64 independent runs, and found that 59.4 percent of them contained flawed test cases: roughly 35.5 percent had tests that were too strict and rejected functionally correct submissions, while about 18.8 percent tested for behavior that the issue never specified.^[4]^[14]

A third concern, sometimes called the scaffolding gap, was that reported SWE-bench Verified scores increasingly reflected the quality of an agent's surrounding scaffolding rather than the model's underlying capability. The same model could vary by 12 percentage points or more between a minimal harness and a heavily tuned agent loop on the same problems, undermining cross-vendor comparison.^[9]

When was SWE-Bench Pro released?

SWE-Bench Pro was announced by Scale AI on September 19, 2025, accompanied by the technical paper "SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?" posted to arXiv as paper 2509.16941 on September 21, 2025 and revised on November 14, 2025. The paper lists Xiang Deng, Jeff Da, and Edwin Pan as lead authors, alongside additional co-authors from Scale AI's research team for a total of 22 listed authors.^[1]^[2]

The announcement framed SWE-Bench Pro as "raising the bar for agentic coding" and explicitly positioned the benchmark as the successor to SWE-bench Verified. Scale AI released the dataset on Hugging Face under the identifier ScaleAI/SWE-bench_Pro and published the evaluation harness as an open-source repository at scaleapi/SWE-bench_Pro-os under the MIT License, including Docker-based execution environments and integrations with both the SWE-agent framework and the lightweight mini-swe-agent scaffold.^[2]^[10]

How is the dataset constructed?

Repository sourcing

A central design goal of SWE-Bench Pro was to construct a benchmark whose problems are unlikely to appear in any major pretraining corpus. Scale AI achieved this through two complementary strategies, each applied to a different portion of the dataset.^[1]^[3]

For the public and held-out portions, Scale AI selected repositories distributed under strong copyleft licenses, principally GPL variants. The rationale, stated in the paper, is that copyleft-licensed code is more likely to have been deliberately excluded from large-scale pretraining corpora because of the obligations that copyleft imposes on derivative works. This is a legal protection: even when the source code is publicly available, model developers face increased downstream risk from training on it.

For the commercial portion, Scale AI entered into formal partnerships with eighteen early-stage startups to license access to proprietary codebases. These commercial repositories are not publicly available at any point, and the benchmark problems derived from them are released only as task identifiers and evaluation results on Scale's leaderboard, with the underlying problem statements, code, and tests held privately by Scale AI. This is a data-access protection: even a determined attempt to contaminate training data could not retrieve the commercial split through web scraping.

The full benchmark draws from 41 repositories, partitioned into 11 public repositories, 12 held-out repositories, and 18 commercial repositories. Each repository contributes between 50 and 100 problem instances, with a strict cap of 100 to prevent any single project from dominating the benchmark. The repositories span "consumer applications, B2B services, and developer tooling platforms," and all are described as "actively maintained professional projects with substantial user bases."^[3]

Task curation pipeline

The task curation pipeline applied to each repository proceeds in three stages, as described in the paper.^[3]

In the sourcing stage, Scale AI identifies consecutive commit pairs that capture an issue's resolution, taking the pre-fix state as the base environment and the diff between the two commits as the reference solution.

In the task description stage, human experts augment the raw issue and patch with a clear problem statement, a list of requirements that specify the expected behavior without prescribing implementation, and interface definitions where relevant. The intent is to remove ambiguity without simplifying the underlying engineering challenge. Every problem must involve at least 10 lines of code change.

In the environments stage, professional engineers construct Docker-based test environments for each problem, define fail-to-pass tests that verify resolution of the original issue, and define pass-to-pass tests that verify that previously working functionality continues to work after the patch is applied. Each test is executed three times during curation to filter out flaky tests, and human reviewers exclude tests deemed "too broad or not relevant" to the issue.

Long-horizon tasks

A key contrast between SWE-Bench Pro and earlier benchmarks is the scale of the tasks. The paper reports that reference solutions span an average of 107.4 lines of code across 4.1 files, with more than 100 tasks requiring more than 100 lines of modification. By comparison, the original SWE-bench, while drawn from real repositories, contains many tasks that can be resolved by editing a single function in a single file. SWE-Bench Pro tasks are explicitly described as "long-horizon" problems that "may require hours to days for a professional software engineer to complete."^[1]^[3]

Problem count and language coverage

SWE-Bench Pro contains 1,865 problem instances in total, partitioned as follows:^[1]^[2]^[3]

Public set: 731 instances drawn from 11 GPL-licensed public repositories. This is the only portion of the benchmark for which problem statements, repository code, and tests are openly distributed.
Held-out set: 858 instances drawn from 12 GPL-licensed public repositories whose problems are kept private but whose source code is publicly accessible.
Commercial set: 276 instances drawn from 18 proprietary repositories under formal startup partnerships, kept fully private.

The benchmark spans four programming languages: Python, JavaScript, TypeScript, and Go. This represents a significant expansion from SWE-bench and SWE-bench Verified, both of which are Python-only. The paper does not publish an exact per-language instance count, but it reports that performance varies substantially by language: models tend to achieve higher resolve rates on Go and Python tasks (sometimes exceeding 30 percent on the public split) and lower, more variable performance on JavaScript and TypeScript tasks, occasionally approaching near-zero resolution.^[1]^[3]

Task types include bug fixes, feature requests, security patches, performance optimizations, and user-interface changes, reflecting the breadth of work performed against production codebases.^[11]

How are models evaluated on SWE-Bench Pro?

SWE-Bench Pro is evaluated using a test-based protocol inherited from the SWE-bench lineage. For each task instance, an agent is given the issue's problem statement and access to the pre-fix snapshot of the repository, and is expected to produce a unified-diff patch. The patch is applied to the repository, the test suite is executed inside the instance-specific Docker container, and the task is scored as resolved only if all fail-to-pass tests pass and no pass-to-pass tests regress.^[2]^[10]

The Scale Labs leaderboard distinguishes between two evaluation regimes. The first uses the mini-swe-agent scaffold, a deliberately lightweight harness developed alongside the benchmark to minimize the influence of agent engineering on the score. The second is an uncapped regime, often labeled "Uncapped (turn limit 250)," in which submitters may use their own scaffolding subject only to an upper limit on the number of agent turns per task. Submissions on the public-set leaderboard are reported with 95 percent confidence intervals on the resolve rate, computed across the 731 problem instances.^[6]^[7]

In published comparisons, OpenAI argued that the more standardized scaffolding offered by SWE-Bench Pro, in particular the mini-swe-agent regime, helps eliminate "scaffolding-driven" score inflation that had become endemic to SWE-bench Verified reporting.^[4]^[9]

How well do AI models perform on SWE-Bench Pro?

Scale AI publishes two distinct leaderboards: one for the public dataset (731 instances) and one for the private dataset, which combines the held-out and commercial splits. Submissions on the private leaderboard are evaluated by Scale AI on behalf of submitters, because the underlying problems are not distributed publicly.^[6]^[7]

Public dataset leaderboard

The original Scale AI launch reported the following top scores on the 731-instance public set, evaluated with Scale's standard agent scaffolding:^[1]^[11]

Model	Resolve rate
OpenAI GPT-5	23.3%
Claude Opus 4.1	22.7%
Claude Sonnet 4	17.6%
Gemini 2.5 Pro Preview	13.5%
SWE-Smith-32B	6.8%
OpenAI GPT-4o	4.9%
Qwen-3 32B	3.4%

As of mid-2026, the Scale Labs leaderboard shows substantially higher scores reflecting subsequent frontier-model releases, with multiple systems clustered between 40 and 60 percent on the public split. Reported scores include the following, with confidence intervals as published by Scale Labs:^[6]

Model	Resolve rate	Scaffolding
GPT-5.4 (xHigh)	59.10 plus or minus 3.56%	mini-swe-agent
Muse Spark	55.00 plus or minus 3.60%	mini-swe-agent
Claude Opus 4.6 (thinking)	51.90 plus or minus 3.61%	mini-swe-agent
Gemini 3 Pro (thinking)	46.10 plus or minus 3.60%	mini-swe-agent
Claude Opus 4.5	45.89 plus or minus 3.60%	Uncapped (250 turns)
Claude Sonnet 4.5	43.60 plus or minus 3.60%	Uncapped (250 turns)

Private dataset leaderboard

The Scale Labs private-dataset leaderboard, which evaluates models on the 858-instance held-out set combined with the 276-instance commercial set, generally shows lower scores than the public leaderboard, consistent with the dataset's design as a stronger generalization test. Selected published results include:^[7]

Model	Resolve rate	Scaffolding
Claude Opus 4.6 (thinking)	47.10 plus or minus 6.07%	mini-swe-agent
Muse Spark	44.70 plus or minus 6.05%	mini-swe-agent
GPT-5.4 (xHigh)	43.40 plus or minus 6.03%	mini-swe-agent
Gemini 3 Pro (thinking)	32.20 plus or minus 5.69%	mini-swe-agent
GPT 5.2 Codex	27.74 plus or minus 5.09%	Not specified
GPT-5.2	23.81 plus or minus 5.09%	Not specified
Claude Opus 4.5	23.44 plus or minus 5.07%	Not specified
Gemini 3 Pro	17.95 plus or minus 4.78%	Not specified
Claude Opus 4.1	17.75 plus or minus 4.51%	Not specified
OpenAI GPT-5	14.86 plus or minus 4.20%	Not specified
Gemini 2.5 Pro Preview	10.14 plus or minus 3.56%	Not specified
Claude Sonnet 4	9.06 plus or minus 3.39%	Not specified
OpenAI GPT-4o	3.62 plus or minus 2.20%	Not specified

Cross-checking the same model between the public and private splits illustrates the generalization gap: Claude Opus 4.1, for example, declined from 22.7 percent on the public set to 17.75 percent on the private set, consistent with Scale AI's claim that performance on truly unseen, proprietary codebases provides a more conservative estimate of underlying capability.^[7]

Third-party agent submissions

Beyond raw model evaluations, multiple AI-coding tool vendors have published SWE-Bench Pro results using their proprietary scaffolds running on top of frontier models. In one such submission, the Augment Code team reported that its Auggie agent, running on Claude Opus 4.5, reached 51.80 percent on the public set, compared to 50.21 percent for Cursor using the same model and 49.75 percent for Claude Code. Augment attributed the gap to its proprietary code-retrieval index rather than to differences in the base model.^[11]

How does SWE-Bench Pro differ from SWE-bench and SWE-bench Verified?

Several structural differences distinguish SWE-Bench Pro from earlier benchmarks in the SWE-bench family.^[1]^[3]^[9]

Property	SWE-bench	SWE-bench Verified	SWE-Bench Pro
Release	October 2023	August 2024	September 2025
Instances	2,294	500	1,865
Languages	Python	Python	Python, JavaScript, TypeScript, Go
Repositories	12 (open source)	Subset of SWE-bench	41 (11 public, 12 held-out, 18 commercial)
Avg files changed	1 to 2	1 to 2	4.1
Avg lines changed	Tens	Tens	107
Contamination defense	None explicit	None explicit	Copyleft licensing, private commercial split
Curator	Princeton	OpenAI	Scale AI

The performance gap between SWE-Bench Pro and its predecessors is also substantial. The same frontier models that achieved over 70 percent on SWE-bench Verified at the time of SWE-Bench Pro's launch scored around 23 percent on SWE-Bench Pro's public set, a drop of more than 45 percentage points.^[1]^[9] The gap persisted even as both leaderboards advanced: Claude Opus 4.5, for instance, scored about 80.9 percent on SWE-bench Verified but only 45.89 percent on the SWE-Bench Pro public set under standardized scaffolding, a difference of roughly 35 points.^[6]^[13] This drop is consistent with the benchmark's stated design intent of restoring measurement headroom, and with the broader observation that SWE-bench Verified had become both saturated and contaminated.

SWE-Bench Pro is distinct from SWE-bench Multimodal, a separate variant introduced by the original SWE-bench authors that incorporates image-based issues, and from SWE-rebench, an independently maintained benchmark of continuously refreshed GitHub issues designed to mitigate contamination through temporal cycling. SWE-Bench Pro's contamination strategy relies on copyleft licensing and proprietary data, rather than on continuous refresh.^[4]^[12]

Reception and criticism

SWE-Bench Pro received substantial attention from both the research community and frontier-model developers in the months following its release. By February 2026, OpenAI announced that it would no longer report SWE-bench Verified scores in connection with new model releases, citing both the contamination and test-quality findings of its audit, and recommended SWE-Bench Pro as the new standard for frontier coding evaluation.^[4]^[5] OpenAI's developer-relations account framed the shift as a response to "model maturity," stating that "the standard for frontier coding evals is changing with model maturity" and that the company would "now recommend reporting SWE-bench Pro" while it works "with the industry to establish stronger coding eval standards."^[5] In the same analysis OpenAI characterized SWE-Bench Pro as imperfect but more robust, noting that its contamination pipeline found cases of contamination that were "significantly rarer and less egregious than SWE-bench Verified, and no model was able to produce a complete verbatim gold patch."^[4]

Anthropic and Google DeepMind also began including SWE-Bench Pro numbers in technical reports for models released in late 2025 and 2026, although several vendors continued to report SWE-bench Verified scores in parallel during a transition period.^[6]^[7]

Critics of SWE-Bench Pro have raised several concerns. Some observers note that the benchmark, despite its protections, is not immune to contamination. Audits have found cases of contamination in SWE-Bench Pro, though described as "significantly rarer and less egregious than SWE-bench Verified," and the long-term effectiveness of the copyleft-license heuristic depends on the data-curation practices of model developers continuing to honor it.^[4] A second concern is methodological: SWE-Bench Pro, like its predecessors, evaluates models in isolation from the code-review workflows, security constraints, and organizational standards that define real-world software engineering. The benchmark measures whether a single agent can produce a patch that passes a hand-curated test suite, not whether the patch would pass code review at the originating organization.^[9]^[13]

A third concern relates to the scaffolding gap. While SWE-Bench Pro's mini-swe-agent regime is designed to neutralize scaffolding differences, the leaderboard also accepts submissions under the uncapped regime, where vendors can deploy proprietary agent loops. The gap between the two regimes on the same base model can exceed 5 percentage points, suggesting that scaffolding still has measurable influence on reported scores.^[6]^[11]

Finally, the commercial portion of the benchmark, while a powerful contamination defense, introduces a tradeoff. Because the commercial-split problems are not publicly auditable, external researchers cannot inspect them for bias, ambiguity, or test quality, and must trust Scale AI's internal verification process. This is in contrast to SWE-bench Verified, whose problems were filtered by humans but remained publicly inspectable.^[9]

Despite these concerns, by mid-2026 SWE-Bench Pro had emerged as the most widely cited coding-agent benchmark in vendor announcements and in independent reporting, taking on a role analogous to that previously held by SWE-bench Verified and earlier by SWE-bench itself.^[4]^[6] Its existence has also influenced the design of subsequent benchmarks, including SWE-Bench++ and SWE-rebench, both of which adopt some combination of contamination resistance and broader language coverage.^[12]

SWE-Bench Pro sits alongside several other contemporary code and reasoning benchmarks, including LiveCodeBench, Aider Polyglot, BigCodeBench, and broader-purpose evaluations such as MMLU and GPQA, as part of a portfolio of measurements used to characterize frontier model capabilities. Within that portfolio, SWE-Bench Pro is generally cited as the strongest available test of long-horizon, multi-file software engineering against codebases that frontier models are unlikely to have seen during training.

References

^Scale AI Research. "SWE-Bench Pro: Raising the Bar for Agentic Coding." Scale AI Blog, September 19, 2025. <scale.com/...swe-bench-pro> (Accessed 2026-06-23).
^Deng, Xiang, Jeff Da, Edwin Pan, et al. "SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?" arXiv preprint 2509.16941, submitted September 21, 2025, revised November 14, 2025. <arxiv.org/...2509.16941> (Accessed 2026-06-23).
^Deng, Xiang, et al. "SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?" arXiv HTML version, 2509.16941. <arxiv.org/...2509.16941v2> (Accessed 2026-06-23).
^OpenAI. "Why SWE-bench Verified no longer measures frontier coding capabilities." OpenAI, February 2026. <openai.com/...o-longer-evaluate-swe-bench-verified> (Accessed 2026-06-23).
^OpenAI Developers. "The standard for frontier coding evals is changing with model maturity..." Post on X (Twitter), February 23, 2026. <x.com/...2026002219909427270> (Accessed 2026-06-23).
^Scale Labs. "SWE-Bench Pro Leaderboard (Public Dataset)." <labs.scale.com/...swe_bench_pro_public> (Accessed 2026-06-23).
^Scale Labs. "SWE-Bench Pro Leaderboard (Private Dataset)." <labs.scale.com/...swe_bench_pro_private> (Accessed 2026-06-23).
^Jimenez, Carlos E., John Yang, Alexander Wettig, et al. "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" arXiv preprint 2310.06770, October 2023. <arxiv.org/...2310.06770> (Accessed 2026-06-23).
^Tessl. "OpenAI moves beyond SWE-bench Verified as coding benchmarks saturate." Tessl Blog, 2026. <tessl.io/...verified-as-coding-benchmarks-saturate> (Accessed 2026-06-23).
^Scale AI. "SWE-bench_Pro-os" (open-source repository). GitHub. <github.com/...SWE-bench_Pro-os> (Accessed 2026-06-23).
^Augment Code. "Auggie tops SWE-Bench Pro." Augment Code Blog, 2026. <augmentcode.com/...auggie-tops-swe-bench-pro> (Accessed 2026-06-23).
^CodeSOTA. "Is SWE-bench Verified Contaminated? OpenAI Shifts to SWE-bench Pro." CodeSOTA News, 2026. <codesota.com/...swe-bench-contamination-debate> (Accessed 2026-06-23).
^Morph Labs. "SWE-Bench Pro Leaderboard (2026): Why 46% Beats 81%." Morph LLM, 2026. <morphllm.com/swe-bench-pro> (Accessed 2026-06-23).
^WebProNews. "SWE-Bench Verified's Sudden Fall: How OpenAI Exposed Flaws in AI Coding's Top Metric." WebProNews, 2026. <webpronews.com/...d-flaws-in-ai-codings-top-metric> (Accessed 2026-06-23).

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributor · v4 · 3,776 words · full history

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Suggest edit

SWE-Bench Pro

What is SWE-Bench Pro?

Background

SWE-bench and SWE-bench Verified

Why was SWE-Bench Pro created?

When was SWE-Bench Pro released?

How is the dataset constructed?

Repository sourcing

Task curation pipeline

Long-horizon tasks

Problem count and language coverage

How are models evaluated on SWE-Bench Pro?

How well do AI models perform on SWE-Bench Pro?

Public dataset leaderboard

Private dataset leaderboard

Third-party agent submissions

How does SWE-Bench Pro differ from SWE-bench and SWE-bench Verified?

Reception and criticism

References

Improve this article

What links here (24 of 31)

What links here (24 of 31)

What is SWE-Bench Pro?

Background

SWE-bench and SWE-bench Verified

Why was SWE-Bench Pro created?

When was SWE-Bench Pro released?

How is the dataset constructed?

Repository sourcing

Task curation pipeline

Long-horizon tasks

Problem count and language coverage

How are models evaluated on SWE-Bench Pro?

How well do AI models perform on SWE-Bench Pro?

Public dataset leaderboard

Private dataset leaderboard

Third-party agent submissions

How does SWE-Bench Pro differ from SWE-bench and SWE-bench Verified?

Reception and criticism

References

Improve this article

Related Articles

Terminal-Bench

SWE-Atlas

HumanEval

LiveCodeBench

SWE-bench Verified

MBPP

What links here (24 of 31)

Related Articles

Terminal-Bench

SWE-Atlas

HumanEval

LiveCodeBench

SWE-bench Verified

MBPP

What links here (24 of 31)