SWE-bench Multilingual
Last reviewed
Jun 8, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,428 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,428 words
Add missing citations, update stale details, or suggest a clearer explanation.
SWE-bench Multilingual is an AI benchmark that measures the ability of AI agents to resolve real-world software bugs and feature requests across programming languages other than Python. It is an official extension of SWE-bench, the widely used software-engineering agent benchmark, and was built by members of the SWE-bench team. The dataset consists of 300 curated issue-resolution tasks drawn from 42 open-source repositories spanning nine programming languages: C, C++, Go, Java, JavaScript, TypeScript, PHP, Ruby, and Rust [1][2].
Each task mirrors the structure of the original SWE-bench: an AI code generation agent is given a real GitHub issue and a snapshot of the repository before the fix, and must produce a code patch that resolves the issue. Success is judged not by string matching but by execution: the patched repository must pass a set of hidden unit tests. SWE-bench Multilingual was released on May 6, 2025, and exists primarily to expose the language-generalization gap that the largely Python-only SWE-bench leaves unmeasured [2][3].
SWE-bench, introduced in 2023 by researchers from Princeton University and the University of Chicago, evaluates agents on 2,294 real GitHub issues drawn from 12 popular Python repositories. SWE-bench Verified, a 500-task human-validated subset released in 2024 by OpenAI in collaboration with the SWE-bench authors, became the de facto industry standard for reporting autonomous coding ability [1][2].
A structural limitation runs through this lineage: the tasks are almost entirely written in Python. As a result, a high SWE-bench Verified score demonstrates that an agent can navigate Python codebases, install Python dependencies, and run Python test suites, but it says little about whether the same agent can work in a statically typed language like Java or Go, a systems language like C or Rust, or a web language like JavaScript, TypeScript, or PHP. These ecosystems differ substantially in build tooling, package management, type systems, and testing conventions. Because real-world software engineering spans dozens of languages, a benchmark confined to one language risks overstating how broadly a coding agent generalizes. SWE-bench Multilingual was created specifically to close this measurement gap [2][3].
SWE-bench Multilingual was developed by Kabir Khandpur, Kilian Lieret, Carlos E. Jimenez, Ofir Press, and John Yang, with Khandpur leading the effort in collaboration with the broader SWE-bench team. Several of these contributors are core authors and maintainers of the original SWE-bench, which is why the project is considered an official member of the SWE-bench family rather than a third-party fork [2][3].
The dataset was assembled with the same execution-based methodology as SWE-bench, adapted to a multilingual setting through a four-stage pipeline [2][3]:
Scoring uses two test categories inherited from SWE-bench: fail-to-pass (F2P) tests, which confirm that the specific issue is actually fixed, and pass-to-pass (P2P) tests, which confirm that existing functionality is not broken by the change. The dataset was deliberately kept small, at 300 tasks, so that the full benchmark is cheap and fast to run while remaining high quality. The tasks are not trivial: the human gold patch modifies a median of 10 lines of code and up to 110 lines at the 95th percentile. The underlying issues are dated from 2021 to 2025, with the largest concentration from 2024 [1][2].
The table below summarizes the dataset.
| Property | SWE-bench Multilingual |
|---|---|
| Number of tasks | 300 |
| Repositories | 42 |
| Languages | 9 (C, C++, Go, Java, JavaScript, TypeScript, PHP, Ruby, Rust) |
| Task type | Real GitHub issue resolution (patch generation) |
| Verification | Execution-based, F2P and P2P unit tests in Docker |
| Released | May 6, 2025 |
| Authors | Khandpur, Lieret, Jimenez, Press, Yang (SWE-bench team) |
SWE-bench Multilingual has become one of the standard coding benchmarks that model developers cite when claiming broad, cross-language software-engineering ability, alongside SWE-bench Verified. It is distributed through the official SWE-bench repository and Hugging Face, and its results sit within the SWE-bench leaderboard family [1][2].
The benchmark was used in frontier model releases beginning in early 2025. Anthropic reported SWE-bench Multilingual scores in connection with Claude 3.7 Sonnet, released in February 2025, where it served to illustrate that strong Python performance did not fully transfer to other languages [4]. Subsequent model comparisons through 2025 and into 2026 continued to use the benchmark to differentiate models on cross-language consistency, with vendors and independent analysts reporting per-language breakdowns rather than a single aggregate, precisely because language-by-language variation is the point of the benchmark [5].
The headline finding from the benchmark's own reference evaluation is a clear language-generalization gap. Using the SWE-agent scaffold with a cost limit of about USD 2.50 per task, Claude 3.7 Sonnet resolved 43 percent of SWE-bench Multilingual tasks, compared with 63 percent on SWE-bench Verified under a comparable setup. The roughly 20-point drop quantifies how much harder non-Python issue resolution was for a leading 2025 coding model [1][2].
Performance also varied widely by language. In the reference run, Rust had the highest resolution rate at about 58 percent and Java was also relatively strong at roughly 53 percent, while C and C++ were the hardest at about 29 percent and Go was near 31 percent. This spread shows that "multilingual" coding ability is not uniform: an agent can be far more effective in some ecosystems than others, which a single aggregate number would obscure [1][2].
As frontier models advanced through 2025 and 2026, reported scores rose and the gap relative to Python narrowed, but cross-language variation persisted as a discriminating signal between top models. Public comparisons in this period highlighted, for example, that some leading models led on most but not all of the languages tested, underscoring that no model had fully closed the per-language gap [5]. Because reported figures depend heavily on the agent scaffold, cost or step budget, and model version, scores should always be attributed to a specific configuration rather than treated as a single canonical number for a model.
SWE-bench Multilingual is one of several extensions of the original SWE-bench, each targeting a different axis of software-engineering capability:
A separate and frequently confused project is Multi-SWE-bench, released by ByteDance's Seed (Doubao) team on April 14, 2025. Multi-SWE-bench is an independent multilingual issue-resolution benchmark, not an official part of the SWE-bench project. It contains 1,632 instances across seven languages (Java, TypeScript, JavaScript, Go, Rust, C, and C++), adds an explicit Easy, Medium, Hard difficulty grading, and ships alongside Multi-SWE-RL, a larger pool of reproducible environments for reinforcement-learning training [6]. The two efforts share a motivation, namely that issue resolution outside Python is underexplored, but they differ in origin, size, language set, and design. SWE-bench Multilingual is the smaller, 300-task official dataset built by the SWE-bench team for fast, high-quality cross-language evaluation, while Multi-SWE-bench is ByteDance's larger, independently constructed corpus. Care should be taken not to compare scores across the two as if they measured the same thing.