Multi-SWE-bench

AI Benchmarks AI Code Generation Model Evaluation

12 min read

Updated Jun 2, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 2, 2026

Fact-checked

In review queue

Sources

5 citations

Revision

v1 · 2,370 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Multi-SWE-bench is a multilingual benchmark for evaluating the ability of large language model based coding systems to resolve real-world software issues across seven programming languages. Built by ByteDance and introduced in the April 2025 paper Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving (arXiv:2504.02605), it extends the Python-only SWE-bench to Java, TypeScript, JavaScript, Go, Rust, C, and C++.^[1]^[2] The benchmark contains 1,632 human-validated issue-resolving instances drawn from 39 GitHub repositories, each packaged with an executable test environment, and it reports results on the same Python task formulation alongside the new languages.^[1]^[3] Multi-SWE-bench was accepted to the NeurIPS 2025 Datasets and Benchmarks track.^[3]

Overview

The issue-resolving task asks a system to modify a codebase so that it produces a patch addressing a natural-language issue report, then to verify the patch against a hidden test suite.^[1] SWE-bench established this format but drew all of its tasks from Python repositories, which left open whether the strong performance of coding agents on Python would carry over to other software ecosystems.^[1] Multi-SWE-bench was created to answer that question directly by holding the task contract fixed while varying the programming language.

Each instance supplies a repository at a specific commit and an issue description, and a candidate solution is judged on whether its generated patch makes the repository pass a set of failing tests without breaking tests that previously passed.^[1] Because the languages span dynamic web scripting, systems programming, and low-level high-performance computing, the benchmark probes capabilities that Python-only evaluation cannot reach, including handling of static type systems, manual memory management, and complex build pipelines.^[1] A central empirical finding of the paper is that leading models resolve Python issues at far higher rates than issues in any other language, which the authors read as evidence that headline SWE-bench scores overstate general software-engineering ability.^[1]

The project ships as an open dataset, an evaluation harness, and a public leaderboard, and it is paired with a companion reinforcement-learning data effort called Multi-SWE-RL.^[1]^[3] The code is released under the Apache License 2.0 and the dataset under a CC0 dedication, subject to the upstream licenses of the adapted open-source projects.^[3]^[4]

Relationship to SWE-bench

Multi-SWE-bench belongs to the broader family of benchmarks descended from SWE-bench. The original SWE-bench contains 2,294 issue-and-pull-request pairs from 12 Python libraries, and its human-filtered subset SWE-bench Verified reduces that to 500 instances that annotators judged solvable.^[1] Multi-SWE-bench reuses those 500 verified Python instances as its Python column so that cross-language comparisons share a common reference point, and it adds 1,632 newly curated instances across the other seven languages.^[1] It is distinct from sibling datasets that vary other dimensions of the task, such as the visual SWE-bench Multimodal and the enterprise-oriented SWE-Bench Pro.

The construction methodology deliberately tracks the standards used for SWE-bench Verified, including dual annotation and cross-review, so that the ground truth meets a comparable bar.^[1] Where SWE-bench labels difficulty informally, Multi-SWE-bench introduces a time-based difficulty scheme described below, and it reports that its task mix is harder: 77.1% of Multi-SWE-bench instances are medium or hard, against 61.2% for SWE-bench Verified by the same measure.^[1]

Languages and dataset

The benchmark draws its 1,632 non-Python instances from 39 repositories that each have more than 500 GitHub stars and at least six months of active maintenance.^[1] The table below gives the composition reported in the paper, with the 500 inherited Python instances shown for context.^[1]

Language	Instances	Repositories	Representative repositories
Python (inherited)	500	12	from SWE-bench Verified ^[1]
Java	128	9	alibaba/fastjson2, elastic/logstash, mockito/mockito ^[1]
TypeScript	224	3	mui/material-ui, vuejs/core, darkreader/darkreader ^[1]
JavaScript	356	6	sveltejs/svelte, iamkun/dayjs, axios/axios ^[1]
Go	428	3	cli/cli, grpc/grpc-go, zeromicro/go-zero ^[1]
Rust	239	10	clap-rs/clap, tokio-rs/tokio, BurntSushi/ripgrep ^[1]
C	128	3	facebook/zstd, ponylang/ponyc, jqlang/jq ^[1]
C++	129	5	nlohmann/json, fmtlib/fmt, simdjson/simdjson ^[1]
Total (seven new languages)	1,632	39	^[1]

The repositories vary widely in scale, from 24 files and 6,700 lines of code to 27,632 files and roughly 698,600 lines.^[1] Issue descriptions and patches also differ by language. Java and Rust issues tend to be long and context-dependent, while JavaScript, Go, and C issues are usually short and localized, and Rust and C++ patches more often require large edits spanning hundreds of lines and several files.^[1]

The authors stratify every instance by difficulty using the human-estimated time to resolve the issue. They record four time buckets, at most 15 minutes, 15 minutes to 1 hour, 1 to 4 hours, and at least 4 hours, then collapse them into three tiers: easy (at most 15 minutes), medium (15 minutes to 1 hour), and hard (at least 1 hour).^[1] The difficulty distribution per language is summarized below.^[1]

Language	Easy	Medium	Hard	Total
Python	194	261	45	500
Java	27	65	36	128
TypeScript	72	88	64	224
JavaScript	10	105	241	356
Go	141	153	134	428
Rust	66	126	47	239
C	30	54	44	128
C++	30	55	44	129

Construction pipeline

The dataset was built with a five-phase pipeline.^[1] In the first phase, repository selection, the authors curated high-quality GitHub repositories for each target language using thresholds on stars and maintenance activity and a requirement that the project include continuous-integration configuration, so that tasks would be buildable and testable.^[1] In the second phase, pull-request crawling, they collected every pull request from the selected repositories and kept only those that linked a GitHub issue, modified test files, and were merged into the main branch.^[1]

The third phase, environment determination, built a Dockerized execution environment for each candidate by extracting repository-common and pull-request-specific dependencies from CI configuration and documentation, and discarded any candidate whose environment could not be made to launch.^[1] The fourth phase, pull-request filtering, ran each repository's test suite under three configurations, the base code, the code with only the test patch applied, and the code with both the fix and test patches, then retained instances showing at least one test that transitioned from failing to passing while rejecting instances with regressions or abnormal transitions.^[1] After this automated funnel, 2,456 candidate instances across 39 repositories remained.^[1]

The fifth phase, manual verification, was carried out by 68 recruited annotators screened for at least two years of experience in the target language and a relevant degree.^[1] Each instance was labeled independently by two annotators and then cross-reviewed into a single agreed label, and a separate quality team of 14 engineers produced reference answers and confirmed that annotations for each language reached at least 80% accuracy.^[1] Applying the verification questionnaire criteria reduced the 2,456 candidates to the final 1,632 high-quality instances.^[1] The annotation results were released publicly to support transparency.^[1]

Evaluation methodology

Multi-SWE-bench uses execution-based grading rather than text similarity. A patch resolves an instance only if it makes the designated failing tests pass while leaving previously passing tests green, and the primary metric is the percentage of instances resolved.^[1] Evaluation runs through a Docker-based harness invoked from a configuration file, which produces a final report listing resolved and unresolved instances.^[3]

The paper evaluates three issue-resolving methods, each originally designed for Python and re-engineered for the multilingual setting.^[1] Agentless, a fixed multi-stage workflow of fault localization and code repair, was adapted as MagentLess, with all prompts revised for the new languages, file skeletons replaced by full file content, function and class extraction reimplemented with Tree-sitter, and the candidate-selection stage removed.^[1] SWE-agent, an agent that acts through a predefined agent-computer interface, was adapted as MSWE-agent with revised prompts, truncated observations, and fixes for language-specific commands and compiled-artifact handling.^[1] OpenHands with CodeAct v2.1 was adapted as MopenHands, which revised prompts and patched diff-rendering bugs that had made patches inapplicable in languages such as Go.^[1]

The nine models tested were GPT-4o (gpt-4o-2024-11-20), OpenAI o1 (o1-2024-12-17), OpenAI o3-mini-high, Claude 3.5 Sonnet, Claude 3.7 Sonnet, DeepSeek V3, DeepSeek-R1, Qwen2.5 72B-Instruct, and Doubao 1.5-pro.^[1]

Notable results

The paper reports resolved rates for every model-and-method combination on each language in its Table 4. The headline pattern is a steep drop from Python to all other languages, with Java a distant second and the web languages TypeScript and JavaScript the weakest.^[1] The best result observed for each language, together with the model and method that produced it, is shown below.^[1]

Language	Best resolved rate	Model	Method
Python	52.20%	Claude 3.7 Sonnet	MopenHands ^[1]
Java	23.44%	Claude 3.7 Sonnet	MSWE-agent ^[1]
Rust	15.90%	Claude 3.7 Sonnet	MopenHands ^[1]
C++	14.73%	Claude 3.7 Sonnet	MopenHands ^[1]
TypeScript	11.61%	Claude 3.5 Sonnet	MopenHands ^[1]
C	8.59%	Claude 3.7 Sonnet	MSWE-agent ^[1]
Go	7.48%	Claude 3.7 Sonnet	MopenHands ^[1]
JavaScript	5.06%	OpenAI o1 / Claude 3.7 Sonnet	MagentLess / MopenHands ^[1]

To illustrate how the methods compare on a single model, the table below lists Claude 3.7 Sonnet's resolved rate across all eight languages under each of the three adapted pipelines.^[1]

Method	Python	Java	TS	JS	Go	Rust	C	C++
MagentLess	44.60%	14.06%	3.57%	1.97%	5.84%	5.44%	2.34%	3.10%
MSWE-agent	45.80%	23.44%	11.16%	4.78%	5.37%	6.69%	8.59%	11.63%
MopenHands	52.20%	21.88%	2.23%	5.06%	7.48%	15.90%	8.59%	14.73%

Performance falls sharply with difficulty. On hard instances, those a human would need more than an hour to fix, resolved rates approach zero for most models and languages, which the authors summarize as agents being effective mainly on issues solvable in under 15 minutes.^[1] Reasoning-oriented models such as OpenAI o1, OpenAI o3-mini-high, Claude 3.5 Sonnet, and Claude 3.7 Sonnet are the strongest overall, while Qwen2.5-72B-Instruct and Doubao-1.5-pro lag, particularly on hard tasks and on C and C++.^[1] The paper also notes that resolved rates tend to rise with longer issue descriptions but drop steeply once fix patches exceed about 600 tokens or touch multiple files.^[1]^[5]

Multi-SWE-RL and the community effort

Alongside the benchmark, the authors launched Multi-SWE-RL, an open-source community aimed at building large-scale reinforcement learning environments for software-engineering tasks.^[1] The initial release contains 4,723 issue-resolving instances spanning 76 open-source repositories and the same seven languages, each wrapped in a fully containerized execution environment for reproducible, plug-and-play training of AI agents.^[1] These instances were produced with the same pipeline as Multi-SWE-bench but without the manual verification stage, trading per-instance vetting for scale.^[1]

The authors frame the creation of realistic, interactive environments as a major bottleneck for scaling reinforcement learning in real-world software, and they position Multi-SWE-RL as a first step toward removing it.^[1] They committed to a rolling update schedule with new versions roughly every three months, covering additional benchmark languages, new RL data, reported RL trial results, and open-sourced models, and they published contribution guidelines and an incentive plan that credits community contributors as authors.^[1]

Versions

Several variants of the benchmark have been released to support different use cases.^[3] Multi-SWE-bench mini, released in April 2025, provides 400 instances covering eight languages for lighter-weight evaluation.^[3] Multi-SWE-bench flash, released in July 2025, offers 300 carefully selected multilingual instances.^[3] The full 1,632-instance benchmark, the mini and flash subsets, and the Multi-SWE-RL dataset are distributed through the project's GitHub repository and Hugging Face dataset pages.^[3]^[4]

Significance

Multi-SWE-bench broadened automated issue-resolving evaluation from a single language to a representative cross-section of modern software, and in doing so it provided concrete evidence that coding agents do not generalize cleanly beyond Python.^[1] By grouping languages into high-level general-purpose programming, web development, systems programming, and low-level high-performance computing, the paper documented a consistent performance hierarchy in which web languages are hardest and Python easiest, giving model builders a clearer map of where capability gaps lie.^[1] The time-based difficulty annotation also offers a more interpretable axis than raw resolution rate, tying machine performance to the human effort an issue would require.^[1]

The benchmark has been used as a reference point in subsequent multilingual coding evaluations and contributed to a wider shift toward language-agnostic agent design and toward reinforcement learning on executable software environments, the gap that Multi-SWE-RL targets.^[1]^[3] Its acceptance to the NeurIPS 2025 Datasets and Benchmarks track reflects its adoption as a standard resource for measuring cross-language code repair.^[3]

Limitations

Multi-SWE-bench inherits the structural limitations of execution-based issue resolving. Test-based grading can credit a patch that satisfies the hidden tests without matching developer intent, and it can reject a functionally valid fix the tests do not anticipate, so resolved rate is an imperfect proxy for correctness.^[1] Coverage is bounded by the 39 selected repositories, which were chosen for popularity, maintenance, and CI support and therefore underrepresent niche projects, proprietary codebases, and languages outside the seven covered.^[1]

The authors also caution that the three evaluation methods were originally optimized for Python, which introduces a bias that likely understates achievable performance in the other languages until method-level adaptation improves.^[1] Reproducing results requires Docker environments per repository, and because the benchmark is an actively maintained leaderboard, comparisons across entries can be affected by differences in agent scaffolding, model versions, and submission dates.^[1]^[3] Finally, the much larger Multi-SWE-RL dataset omits the manual verification that underpins the benchmark, so its instances carry weaker quality guarantees and are intended for training rather than evaluation.^[1]

References

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

SWE-Atlas