SWE-bench Multilingual
Last reviewed
Sources
7 citations
Review status
Source-backed
Revision
v2 · 1,690 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
7 citations
Review status
Source-backed
Revision
v2 · 1,690 words
Add missing citations, update stale details, or suggest a clearer explanation.
SWE-bench Multilingual is an AI benchmark of 300 real-world software bug-fixing tasks drawn from 42 open-source repositories across nine programming languages, built by members of the SWE-bench team to measure whether AI agents can resolve GitHub issues in languages other than Python. Released on May 6, 2025, it is the official multilingual extension of SWE-bench, the widely used software-engineering agent benchmark, and it spans C, C++, Go, Java, JavaScript, TypeScript, PHP, Ruby, and Rust [1][2][3]. In its own reference evaluation, Claude 3.7 Sonnet resolved 43 percent of SWE-bench Multilingual tasks versus 63 percent on SWE-bench Verified, quantifying how much harder non-Python issue resolution remains for a leading 2025 coding model [2][3].
Each task mirrors the structure of the original SWE-bench: an AI code generation agent is given a real GitHub issue and a snapshot of the repository before the fix, and must produce a code patch that resolves the issue. Success is judged not by string matching but by execution: the patched repository must pass a set of hidden unit tests. SWE-bench Multilingual exists primarily to expose the language-generalization gap that the largely Python-only SWE-bench leaves unmeasured [2][3].
SWE-bench, introduced in 2023 by researchers from Princeton University and the University of Chicago, evaluates agents on 2,294 real GitHub issues drawn from 12 popular Python repositories. SWE-bench Verified, a 500-task human-validated subset released in 2024 by OpenAI in collaboration with the SWE-bench authors, became the de facto industry standard for reporting autonomous coding ability [1][2].
A structural limitation runs through this lineage: the tasks are almost entirely written in Python. As a result, a high SWE-bench Verified score demonstrates that an agent can navigate Python codebases, install Python dependencies, and run Python test suites, but it says little about whether the same agent can work in a statically typed language like Java or Go, a systems language like C or Rust, or a web language like JavaScript, TypeScript, or PHP. These ecosystems differ substantially in build tooling, package management, type systems, and testing conventions. Because real-world software engineering spans dozens of languages, a benchmark confined to one language risks overstating how broadly a coding agent generalizes. The SWE-bench team summarized the motivation plainly: "LLMs are more proficient in Python than other languages." SWE-bench Multilingual was created specifically to close this measurement gap [2][3].
SWE-bench Multilingual was developed by Kabir Khandpur, Kilian Lieret, Carlos E. Jimenez, Ofir Press, and John Yang, with Khandpur leading the effort in collaboration with the broader SWE-bench team. Several of these contributors are core authors and maintainers of the original SWE-bench, which is why the project is considered an official member of the SWE-bench family rather than a third-party fork [2][3].
The dataset was assembled with the same execution-based methodology as SWE-bench, adapted to a multilingual setting through a four-stage pipeline [2][3]:
Scoring uses two test categories inherited from SWE-bench: fail-to-pass (F2P) tests, which confirm that the specific issue is actually fixed, and pass-to-pass (P2P) tests, which confirm that existing functionality is not broken by the change. The dataset was deliberately kept small, at 300 tasks, so that the full benchmark is cheap and fast to run while remaining high quality. The tasks are not trivial: the human gold patch modifies a median of 10 lines of code and up to 110 lines at the 95th percentile. The underlying issues are dated from 2021 to 2025, with the largest concentration from 2024. The dataset is published by the SWE-bench organization on Hugging Face as SWE-bench/SWE-bench_Multilingual, with sample repositories including Apache Druid and Lucene (Java) and Ruff (Rust) [1][2][4].
The table below summarizes the dataset.
| Property | SWE-bench Multilingual |
|---|---|
| Number of tasks | 300 |
| Repositories | 42 |
| Languages | 9 (C, C++, Go, Java, JavaScript, TypeScript, PHP, Ruby, Rust) |
| Task type | Real GitHub issue resolution (patch generation) |
| Verification | Execution-based, F2P and P2P unit tests in Docker |
| Released | May 6, 2025 |
| Authors | Khandpur, Lieret, Jimenez, Press, Yang (SWE-bench team) |
| Distribution | Hugging Face (SWE-bench/SWE-bench_Multilingual), SWE-bench repo |
SWE-bench Multilingual has become one of the standard coding benchmarks that model developers and analysts cite when assessing broad, cross-language software-engineering ability, alongside SWE-bench Verified. It is distributed through the official SWE-bench repository and Hugging Face, and its results sit within the SWE-bench leaderboard family [1][2].
The benchmark's headline reference numbers come from the SWE-bench team's own evaluation rather than from a model vendor. Running Claude 3.7 Sonnet (released by Anthropic in February 2025) inside the SWE-agent scaffold with a cost limit of about USD 2.50, the SWE-bench team measured a 43 percent resolution rate on SWE-bench Multilingual against 63 percent on SWE-bench Verified for a comparable setup, which is the figure most widely repeated for the benchmark [2][3]. Anthropic's own Claude 3.7 Sonnet announcement reported SWE-bench Verified but did not report a SWE-bench Multilingual score, so the Multilingual numbers should be attributed to the SWE-bench reference run, not to Anthropic [5]. Subsequent model comparisons through 2025 and into 2026 continued to use the benchmark to differentiate models on cross-language consistency, with vendors and independent analysts reporting per-language breakdowns rather than a single aggregate, precisely because language-by-language variation is the point of the benchmark [6].
The headline finding from the benchmark's own reference evaluation is a clear language-generalization gap. As the SWE-bench team put it, "Claude 3.7 Sonnet achieves a 43% resolution rate on SWE-bench Multilingual, compared to 63% on SWE-bench Verified, highlighting room for improvement in languages other than Python." Using the SWE-agent scaffold with a cost limit of about USD 2.50 per task, the roughly 20-point drop quantifies how much harder non-Python issue resolution was for a leading 2025 coding model [2][3].
Performance also varied widely by language. In the reference run, Rust had the highest resolution rate at 58.14 percent and Java was also strong at 53.49 percent, followed by PHP at 48.84 percent and Ruby at 43.18 percent. The hardest categories were JavaScript/TypeScript at 34.88 percent, Go at 30.95 percent, and C/C++ at 28.57 percent. This spread shows that "multilingual" coding ability is not uniform: an agent can be far more effective in some ecosystems than others, which a single aggregate number would obscure [2][3].
| Language | Claude 3.7 Sonnet resolution rate (SWE-agent, ~USD 2.50 limit) |
|---|---|
| Rust | 58.14% |
| Java | 53.49% |
| PHP | 48.84% |
| Ruby | 43.18% |
| JavaScript / TypeScript | 34.88% |
| Go | 30.95% |
| C / C++ | 28.57% |
| Overall | 43% |
As frontier models advanced through 2025 and 2026, reported scores rose and the gap relative to Python narrowed, but cross-language variation persisted as a discriminating signal between top models. Public comparisons in this period highlighted, for example, that some leading models led on most but not all of the languages tested, underscoring that no model had fully closed the per-language gap [6]. Because reported figures depend heavily on the agent scaffold, cost or step budget, and model version, scores should always be attributed to a specific configuration rather than treated as a single canonical number for a model.
SWE-bench Multilingual is one of several extensions of the original SWE-bench, each targeting a different axis of software-engineering capability:
A separate and frequently confused project is Multi-SWE-bench, released by ByteDance's Seed (Doubao) team on April 14, 2025. Multi-SWE-bench is an independent multilingual issue-resolution benchmark, not an official part of the SWE-bench project. It contains 1,632 instances across seven languages (Java, TypeScript, JavaScript, Go, Rust, C, and C++), adds an explicit Easy, Medium, Hard difficulty grading, and ships alongside Multi-SWE-RL, a larger pool of reproducible environments for reinforcement-learning training [7]. The two efforts share a motivation, namely that issue resolution outside Python is underexplored, but they differ in origin, size, language set, and design. SWE-bench Multilingual is the smaller, 300-task official dataset built by the SWE-bench team for fast, high-quality cross-language evaluation, while Multi-SWE-bench is ByteDance's larger, independently constructed corpus. Care should be taken not to compare scores across the two as if they measured the same thing.