SWE-bench Multilingual

AI Benchmarks AI Code Generation

8 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

7 citations

Revision

v2 · 1,690 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

What is SWE-bench Multilingual?

SWE-bench Multilingual is an AI benchmark of 300 real-world software bug-fixing tasks drawn from 42 open-source repositories across nine programming languages, built by members of the SWE-bench team to measure whether AI agents can resolve GitHub issues in languages other than Python. Released on May 6, 2025, it is the official multilingual extension of SWE-bench, the widely used software-engineering agent benchmark, and it spans C, C++, Go, Java, JavaScript, TypeScript, PHP, Ruby, and Rust ^[1]^[2]^[3]. In its own reference evaluation, Claude 3.7 Sonnet resolved 43 percent of SWE-bench Multilingual tasks versus 63 percent on SWE-bench Verified, quantifying how much harder non-Python issue resolution remains for a leading 2025 coding model ^[2]^[3].

Each task mirrors the structure of the original SWE-bench: an AI code generation agent is given a real GitHub issue and a snapshot of the repository before the fix, and must produce a code patch that resolves the issue. Success is judged not by string matching but by execution: the patched repository must pass a set of hidden unit tests. SWE-bench Multilingual exists primarily to expose the language-generalization gap that the largely Python-only SWE-bench leaves unmeasured ^[2]^[3].

Why was SWE-bench Multilingual created? SWE-bench is Python-centric

SWE-bench, introduced in 2023 by researchers from Princeton University and the University of Chicago, evaluates agents on 2,294 real GitHub issues drawn from 12 popular Python repositories. SWE-bench Verified, a 500-task human-validated subset released in 2024 by OpenAI in collaboration with the SWE-bench authors, became the de facto industry standard for reporting autonomous coding ability ^[1]^[2].

A structural limitation runs through this lineage: the tasks are almost entirely written in Python. As a result, a high SWE-bench Verified score demonstrates that an agent can navigate Python codebases, install Python dependencies, and run Python test suites, but it says little about whether the same agent can work in a statically typed language like Java or Go, a systems language like C or Rust, or a web language like JavaScript, TypeScript, or PHP. These ecosystems differ substantially in build tooling, package management, type systems, and testing conventions. Because real-world software engineering spans dozens of languages, a benchmark confined to one language risks overstating how broadly a coding agent generalizes. The SWE-bench team summarized the motivation plainly: "LLMs are more proficient in Python than other languages." SWE-bench Multilingual was created specifically to close this measurement gap ^[2]^[3].

What does SWE-bench Multilingual cover?

SWE-bench Multilingual was developed by Kabir Khandpur, Kilian Lieret, Carlos E. Jimenez, Ofir Press, and John Yang, with Khandpur leading the effort in collaboration with the broader SWE-bench team. Several of these contributors are core authors and maintainers of the original SWE-bench, which is why the project is considered an official member of the SWE-bench family rather than a third-party fork ^[2]^[3].

The dataset was assembled with the same execution-based methodology as SWE-bench, adapted to a multilingual setting through a four-stage pipeline ^[2]^[3]:

Repository selection. Top-starred repositories were chosen across the nine target languages, yielding 42 source repositories.
Issue collection. Real GitHub issue-and-pull-request pairs were extracted using the original SWE-bench data pipeline, so every task corresponds to a genuine bug fix or feature that human maintainers actually merged.
Environment configuration. Base and per-instance Docker images were built to provide reproducible, executable environments. Because the 42 repositories share little dependency overlap, the intermediate "environment" image layer used in Python-only SWE-bench was skipped.
Task validation. Every task was manually verified to confirm that the relevant tests fail on the unpatched code and pass once the human "gold" patch is applied.

Scoring uses two test categories inherited from SWE-bench: fail-to-pass (F2P) tests, which confirm that the specific issue is actually fixed, and pass-to-pass (P2P) tests, which confirm that existing functionality is not broken by the change. The dataset was deliberately kept small, at 300 tasks, so that the full benchmark is cheap and fast to run while remaining high quality. The tasks are not trivial: the human gold patch modifies a median of 10 lines of code and up to 110 lines at the 95th percentile. The underlying issues are dated from 2021 to 2025, with the largest concentration from 2024. The dataset is published by the SWE-bench organization on Hugging Face as SWE-bench/SWE-bench_Multilingual, with sample repositories including Apache Druid and Lucene (Java) and Ruff (Rust) ^[1]^[2]^[4].

The table below summarizes the dataset.

Property	SWE-bench Multilingual
Number of tasks	300
Repositories	42
Languages	9 (C, C++, Go, Java, JavaScript, TypeScript, PHP, Ruby, Rust)
Task type	Real GitHub issue resolution (patch generation)
Verification	Execution-based, F2P and P2P unit tests in Docker
Released	May 6, 2025
Authors	Khandpur, Lieret, Jimenez, Press, Yang (SWE-bench team)
Distribution	Hugging Face (SWE-bench/SWE-bench_Multilingual), SWE-bench repo

How is SWE-bench Multilingual used in model evaluations?

SWE-bench Multilingual has become one of the standard coding benchmarks that model developers and analysts cite when assessing broad, cross-language software-engineering ability, alongside SWE-bench Verified. It is distributed through the official SWE-bench repository and Hugging Face, and its results sit within the SWE-bench leaderboard family ^[1]^[2].

The benchmark's headline reference numbers come from the SWE-bench team's own evaluation rather than from a model vendor. Running Claude 3.7 Sonnet (released by Anthropic in February 2025) inside the SWE-agent scaffold with a cost limit of about USD 2.50, the SWE-bench team measured a 43 percent resolution rate on SWE-bench Multilingual against 63 percent on SWE-bench Verified for a comparable setup, which is the figure most widely repeated for the benchmark ^[2]^[3]. Anthropic's own Claude 3.7 Sonnet announcement reported SWE-bench Verified but did not report a SWE-bench Multilingual score, so the Multilingual numbers should be attributed to the SWE-bench reference run, not to Anthropic ^[5]. Subsequent model comparisons through 2025 and into 2026 continued to use the benchmark to differentiate models on cross-language consistency, with vendors and independent analysts reporting per-language breakdowns rather than a single aggregate, precisely because language-by-language variation is the point of the benchmark ^[6].

How do models perform on SWE-bench Multilingual?

The headline finding from the benchmark's own reference evaluation is a clear language-generalization gap. As the SWE-bench team put it, "Claude 3.7 Sonnet achieves a 43% resolution rate on SWE-bench Multilingual, compared to 63% on SWE-bench Verified, highlighting room for improvement in languages other than Python." Using the SWE-agent scaffold with a cost limit of about USD 2.50 per task, the roughly 20-point drop quantifies how much harder non-Python issue resolution was for a leading 2025 coding model ^[2]^[3].

Performance also varied widely by language. In the reference run, Rust had the highest resolution rate at 58.14 percent and Java was also strong at 53.49 percent, followed by PHP at 48.84 percent and Ruby at 43.18 percent. The hardest categories were JavaScript/TypeScript at 34.88 percent, Go at 30.95 percent, and C/C++ at 28.57 percent. This spread shows that "multilingual" coding ability is not uniform: an agent can be far more effective in some ecosystems than others, which a single aggregate number would obscure ^[2]^[3].

Language	Claude 3.7 Sonnet resolution rate (SWE-agent, ~USD 2.50 limit)
Rust	58.14%
Java	53.49%
PHP	48.84%
Ruby	43.18%
JavaScript / TypeScript	34.88%
Go	30.95%
C / C++	28.57%
Overall	43%

As frontier models advanced through 2025 and 2026, reported scores rose and the gap relative to Python narrowed, but cross-language variation persisted as a discriminating signal between top models. Public comparisons in this period highlighted, for example, that some leading models led on most but not all of the languages tested, underscoring that no model had fully closed the per-language gap ^[6]. Because reported figures depend heavily on the agent scaffold, cost or step budget, and model version, scores should always be attributed to a specific configuration rather than treated as a single canonical number for a model.

How does SWE-bench Multilingual relate to other SWE-bench variants?

SWE-bench Multilingual is one of several extensions of the original SWE-bench, each targeting a different axis of software-engineering capability:

SWE-bench Verified: a 500-task, human-validated Python subset of the original benchmark, focused on reliability of the evaluation rather than language coverage. SWE-bench Multilingual adds the language dimension that Verified lacks ^[1].
SWE-bench Multimodal: an extension, introduced in October 2024 by John Yang and collaborators, that evaluates agents on visual, user-facing JavaScript software, such as UI, diagramming, and data-visualization issues whose problem descriptions include images or screenshots. It probes modality generalization, whereas Multilingual probes language generalization ^[1].
SWE-bench Pro and SWE-bench Multi: later additions to the SWE-bench leaderboard family that target, respectively, harder enterprise-grade tasks and additional evaluation settings.

A separate and frequently confused project is Multi-SWE-bench, released by ByteDance's Seed (Doubao) team on April 14, 2025. Multi-SWE-bench is an independent multilingual issue-resolution benchmark, not an official part of the SWE-bench project. It contains 1,632 instances across seven languages (Java, TypeScript, JavaScript, Go, Rust, C, and C++), adds an explicit Easy, Medium, Hard difficulty grading, and ships alongside Multi-SWE-RL, a larger pool of reproducible environments for reinforcement-learning training ^[7]. The two efforts share a motivation, namely that issue resolution outside Python is underexplored, but they differ in origin, size, language set, and design. SWE-bench Multilingual is the smaller, 300-task official dataset built by the SWE-bench team for fast, high-quality cross-language evaluation, while Multi-SWE-bench is ByteDance's larger, independently constructed corpus. Care should be taken not to compare scores across the two as if they measured the same thing.

References

SWE-bench, "SWE-bench Leaderboards" and SWE-bench project pages. https://www.swebench.com/ ↩
SWE-bench, "SWE-bench Multilingual." https://www.swebench.com/multilingual.html ↩
Kabir Khandpur, "SWE-bench Multilingual," May 6, 2025. https://kabirk.com/multilingual ↩
Hugging Face, "SWE-bench/SWE-bench_Multilingual" dataset. https://huggingface.co/datasets/SWE-bench/SWE-bench_Multilingual ↩
Anthropic, "Claude 3.7 Sonnet and Claude Code," February 2025. https://www.anthropic.com/news/claude-3-7-sonnet ↩
CodeAnt AI, "SWE-bench Leaderboard 2026: All Model Scores, Rankings & What They Actually Mean." https://www.codeant.ai/blogs/swe-bench-scores ↩
ByteDance Seed, "Multi-SWE-bench: First Multilingual Code Fix Benchmark Open Source," April 14, 2025. https://seed.bytedance.com/en/blog/multi-swe-bench-first-multilingual-code-fix-benchmark-open-source ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Kimi K2.5 MiniMax M2

What is SWE-bench Multilingual?

Why was SWE-bench Multilingual created? SWE-bench is Python-centric

What does SWE-bench Multilingual cover?

How is SWE-bench Multilingual used in model evaluations?

How do models perform on SWE-bench Multilingual?

How does SWE-bench Multilingual relate to other SWE-bench variants?

References

Improve this article

Related Articles

HumanEval

LiveCodeBench

SWE-bench Verified

MBPP

CodeContests

CRUXEval

What links here

Related Articles

HumanEval

LiveCodeBench

SWE-bench Verified

MBPP

CodeContests

CRUXEval

What links here