SWE-bench Multimodal

AI Benchmarks AI Code Generation Multimodal AI

10 min read

Updated Jun 2, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 2, 2026

Fact-checked

In review queue

Sources

6 citations

Revision

v1 · 1,903 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

SWE-bench Multimodal (also written SWE-bench M) is a benchmark that measures whether autonomous software-engineering systems can resolve bugs in visual, user-facing software by acting on issue reports that contain images. It extends SWE-bench from text-only Python tasks to JavaScript projects whose problem statements include screenshots, diagrams, and other visual material, and it was introduced in the 2024 paper SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains? by John Yang, Carlos E. Jimenez, and collaborators from Princeton University, Stanford University, and Meta AI.^[1]^[2] The dataset contains 619 task instances drawn from 17 popular JavaScript libraries, each paired with executable unit tests, and it has become the standard probe for the visual and cross-language generalization of code-fixing AI agents.^[1]^[3]

Overview

The original SWE-bench evaluates systems on 2,294 GitHub issue-and-pull-request pairs taken entirely from Python repositories, with problem statements presented as plain text.^[4] SWE-bench Multimodal was built to answer a narrower question that the Python-only design could not address: do systems that score well on text-based Python repair also work on other languages and on issues that depend on visual information.^[1] To that end the authors assembled tasks from JavaScript projects in visual domains, where users frequently attach a screenshot of a broken interface or a mockup of the intended result rather than describing the problem in prose.^[1]^[2]

Each instance follows the same contract as SWE-bench. A system receives a codebase at a specific commit and a natural-language issue description, and it must produce a patch that makes the repository pass a set of hidden tests associated with the corresponding fix.^[1]^[4] What differs in SWE-bench Multimodal is the input distribution: the language is JavaScript rather than Python, the application domains are visual, and the issue text is accompanied by images that often carry information not present in the words.^[1]

A central finding of the paper is that strong performance on Python SWE-bench does not transfer cleanly to this setting. Systems that topped the SWE-bench leaderboard at the time resolved only a small fraction of SWE-bench Multimodal tasks, exposing weaknesses in both visual understanding and generalization across programming languages and software paradigms.^[1]^[3]

Relationship to SWE-bench

SWE-bench Multimodal is one of several datasets in the broader SWE-bench family maintained by the same group. The original SWE-bench and its human-filtered subset SWE-bench Verified are Python-only and text-only.^[4] SWE-bench Multimodal keeps the issue-to-patch task formulation but changes the language to JavaScript and adds the visual dimension, making it a generalization stress test rather than a harder version of the same distribution.^[1] It is distinct from later additions to the family such as SWE-Bench Pro and the freelance-task benchmark SWE-Lancer, which target different concerns.

The benchmark was integrated into the main SWE-bench code repository, and its test split uses a private evaluation procedure: gold patches and full test outcomes for the test instances are withheld, and submissions are scored through the project's hosted command-line tooling to limit contamination and overfitting.^[3] This mirrors the access controls the maintainers apply to other private SWE-bench splits.

What it evaluates

SWE-bench Multimodal targets two capabilities that the Python text-only benchmarks leave untested.^[1] The first is multimodal reasoning: because issues come with screenshots, design mockups, diagrams, and rendered error messages, a system has to interpret pixels, not just tokens, and connect what it sees to the relevant source code. Annotators judged that images were necessary to solve the task in 83.5% of instances, and that roughly 80% of images conveyed information beyond what the accompanying text stated, so visual input is load-bearing rather than decorative.^[1]

The second capability is cross-language and cross-domain generalization. The tasks are JavaScript projects for data visualization, diagramming, web UI components, mapping, and syntax highlighting, which exercise rendering logic, the document object model, and front-end frameworks rather than the scientific and backend Python libraries that dominate SWE-bench.^[1] A system that has effectively memorized patterns from Python repositories cannot rely on that experience here, which is what makes the benchmark a generalization probe and ties it to interest in computer vision and multimodal model capabilities for coding.

Dataset and construction

The benchmark contains 619 task instances. The authors split them into a public development set of 102 instances drawn from 5 repositories and a test set of 517 instances drawn from 12 repositories, for 17 JavaScript libraries in total.^[1] The abstract rounds the headline figure to 617 task instances.^[2] The public dataset card on Hugging Face lists slightly different counts for the released subset, reflecting which instances and columns are made public versus held back for private test scoring.^[5]

Construction followed a five-stage pipeline.^[1] First, the authors selected user-facing JavaScript repositories with large communities, using thresholds of at least 5,000 stars and 500 pull requests, in domains where issues tend to include visuals. Second, they filtered pull requests to those whose linked issue or test changes contained images or videos. Third, they built Docker execution environments capable of running Node.js together with a headless Chrome browser, since the projects render visual output. Fourth, they ran each candidate's tests 10 times and discarded instances with inconsistent or flaky results. Fifth, human annotators reviewed the remaining instances to remove impossible tasks and to categorize the images. The funnel began with roughly 135,000 pull requests, narrowed to 1,478 candidates, then 679 executable instances, then 643 after consistency checks, and finally 619 validated instances.^[1]

The validated set contains 862 images in problem statements, which the authors group into seven categories: website screenshots (401), code screenshots (194), diagrams (107), error messages (54), art (38), maps (35), and data visualizations (28).^[1] This distribution shows that most visual context is genuine interface or rendering information rather than incidental imagery.

Task design and harness

Like SWE-bench, the benchmark uses execution-based evaluation built on unit tests rather than text similarity.^[1]^[4] Each instance defines two test groups. FAIL_TO_PASS tests fail on the unpatched code and must pass after a correct fix; they verify that the bug is actually resolved. PASS_TO_PASS tests pass before and must still pass after the patch; they guard against regressions and trivially broad edits.^[1]^[4] A patch counts as resolving an instance only if every FAIL_TO_PASS test passes and the PASS_TO_PASS tests remain green.^[1]

Because the software is visual, the harness runs inside containers that provide Node.js and a headless browser so that rendering-dependent tests can execute.^[1] A subset of 69 instances uses pixel-level visual testing, comparing rendered output against reference images with browser-automation and image-diffing tooling such as Puppeteer and Pixelmatch, so that some tasks are graded on what the interface looks like and not only on internal program state.^[1]

Metrics

The primary metric is the percentage of task instances resolved, written as % Resolved, defined as the proportion of instances for which the system's patch satisfies the full FAIL_TO_PASS and PASS_TO_PASS test conditions.^[1]^[3] Because grading is execution-based, the score reflects whether generated patches actually function, not whether they resemble the reference patch. Leaderboard entries commonly report associated metadata such as cost per instance and whether trajectories and logs are open, alongside the headline resolution rate.^[3]

Notable results

The paper reported low absolute scores for the systems available in 2024, with the agentic SWE-agent configuration clearly ahead of retrieval-augmented generation (RAG) and the Python-oriented Agentless pipeline.^[1] The authors attributed SWE-agent's advantage to its flexible, language-agnostic interface, which let it explore unfamiliar JavaScript codebases more effectively than approaches tuned to Python.^[1] The table below lists representative baseline results from the paper on the test split.

System	Model	% Resolved
SWE-agent Multimodal	GPT-4o (2024-08-06)	12.19 ^[1]^[3]
SWE-agent	Claude 3.5 Sonnet	12.19 ^[3]
SWE-agent JavaScript	Claude 3.5 Sonnet	11.99 ^[3]
SWE-agent	GPT-4o (2024-08-06)	11.99 ^[3]
SWE-agent Multimodal	Claude 3.5 Sonnet	11.41 ^[1]^[3]
Agentless	Claude 3.5 Sonnet	6.19 ^[3]
RAG	GPT-4o (2024-08-06)	6.00 ^[1]^[3]
RAG	Claude 3.5 Sonnet	5.03 ^[3]
Agentless	GPT-4o (2024-08-06)	3.09 ^[3]

Scores rose sharply as stronger models and purpose-built agents were submitted to the public leaderboard. By mid-2025 several systems exceeded 30% resolved, roughly tripling the original baselines, although the figures remain far below the 60% to 90% rates that the best agents reach on Python SWE-bench Verified.^[3]^[6] The next table shows leading entries from the hosted SWE-bench Multimodal leaderboard.

System	% Resolved	Date	Source
GUIRepair + o3	35.98	2025-07-01	^[3]
Codefuse Pycfuse SVR	35.98	2025-11-17	^[3]
Refact.ai Agent	35.59	2025-06-11	^[3]
OpenHands-Versa (Claude Sonnet 4)	34.43	2025-05-28	^[3]
GUIRepair + o4-mini	33.85	2025-05-31	^[3]
OpenHands-Versa (Claude 3.7 Sonnet)	31.33	2025-05-09	^[3]
Zencoder	30.56	2025-04-01	^[3]
Globant Code Fixer Agent	29.59	2025-03-25	^[3]
Agentless Lite + Claude 3.5 Sonnet	25.34	2025-02-26	^[3]

Several of these systems, including the OpenHands-Versa entries built on the OpenHands agent platform, demonstrate that agents designed for general software tasks can be adapted to the visual JavaScript setting.^[3] Because the leaderboard accepts ongoing submissions, the current top score may differ from the values above.

Significance

SWE-bench Multimodal extended automated-program-repair evaluation in two directions that the original benchmark had left open. By moving to JavaScript it tested whether large language model based agents generalize beyond Python, and by requiring image interpretation it tested whether they can use visual evidence the way human developers do when triaging an interface bug.^[1] The early result that leading text-based systems collapsed to single-digit resolution rates provided concrete evidence that headline SWE-bench numbers overstated general software-engineering ability.^[1]^[3]

The benchmark also influenced how agents are designed and reported. The advantage of the language-agnostic SWE-agent interface over Python-specialized pipelines reinforced a shift toward flexible agent scaffolds, and many later coding systems report SWE-bench Multimodal scores alongside Python results to document cross-domain robustness.^[1]^[3] It sits within a wider set of SWE-bench-style benchmarks that probe distinct facets of practical software work.

Limitations

The benchmark inherits the structural limitations of the SWE-bench format. Test-based grading can credit a patch that passes the hidden tests without fully matching developer intent, and it can penalize a functionally valid fix that the tests do not anticipate, so resolution rate is an imperfect proxy for correctness.^[4] Coverage is also bounded: the tasks come from 17 visual JavaScript libraries selected for popularity and for the presence of images, which does not represent backend services, other languages, or non-visual front-end work.^[1]

Practical constraints apply as well. Reproducing results requires Docker environments with Node.js and a headless browser, and the test split is evaluated privately through hosted tooling, so independent verification of a reported test-set number depends on the maintainers' infrastructure rather than fully open local execution.^[3] As with any actively maintained leaderboard, comparisons across entries can be complicated by differences in agent scaffolding, underlying model versions, inference cost, and the date of submission.^[3]

References

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

Multi-SWE-bench SWE-bench Multilingual

Overview

Relationship to SWE-bench

What it evaluates

Dataset and construction

Task design and harness

Metrics

Notable results

Significance

Limitations

References

Improve this article

Related Articles

Claude Sonnet 4.5

ERQA

Fox (benchmark)

MMMU

Paper2Video

Visual Question Answering Models

What links here

Related Articles

Claude Sonnet 4.5

ERQA

Fox (benchmark)

MMMU

Paper2Video

Visual Question Answering Models

What links here