SWE-Atlas

AI Agents AI Benchmarks AI Code Generation

8 min read

Updated Jun 3, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 3, 2026

Fact-checked

In review queue

Sources

3 citations

Revision

v1 · 1,534 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

SWE-Atlas is a benchmark for evaluating AI coding agents on professional software-engineering work that goes beyond fixing bugs and resolving issues. Built by Scale AI and introduced in the May 2026 paper SWE Atlas: Benchmarking Coding Agents Beyond Issue Resolution (arXiv:2605.08366), it groups 284 expert-authored tasks into three workflows that earlier benchmarks largely ignored: Codebase Q&A (124 tasks), Test Writing (90 tasks), and Refactoring (70 tasks).^[1]^[2] Rather than scoring only whether a patch makes a hidden test suite pass, SWE-Atlas pairs programmatic checks with rubric-based grading so that it can measure engineering quality, including test and refactor completeness, maintainability, reusable abstractions, and codebase hygiene.^[1] The paper, submitted on 8 May 2026 by a team of 15 Scale AI researchers led by Mohit Raghavendra and including Soham Dan, Yannis Yiming He, and Yunzhong He, reports that frontier models such as GPT-5.4 and Claude Opus 4.7 lead the field while the best open-weight models score poorly.^[1]^[3]

Why it extends issue-resolution benchmarks

Most agentic-coding evaluation has converged on the issue-resolution format popularized by SWE-bench, where a model receives a GitHub issue and a repository snapshot and must produce a patch that passes a held-out test suite.^[1] That format has been productive, and it spawned a whole family of variants, including SWE-bench Verified, the multilingual Multi-SWE-bench, and the enterprise-scale SWE-Bench Pro. But it captures a narrow slice of what software engineers actually do. A working developer spends a lot of time reading unfamiliar code to answer questions, writing tests to lock in behavior, and refactoring to keep a codebase healthy, and none of that reduces cleanly to "did the patch turn the tests green."

SWE-Atlas was designed to fill that gap. The authors describe three deliberate departures from prior benchmarks.^[1] First, it targets task categories that are underrepresented but practically important: comprehension, testing, and refactoring rather than only bug fixes and feature work. Second, it uses category-specific evaluation protocols instead of a single pass/fail test gate, because a good answer to a codebase question or a well-structured refactor cannot be judged by tests alone. Third, the tasks are deliberately under-specified and agentic. Prompts read more like the vague requests a real teammate would send than the tightly scoped issue reports used in SWE-bench, so the agent has to explore the repository, gather context, and decide what to do.

What it measures and how

The 284 tasks are drawn from 18 actively maintained open-source repositories, most of them under the GPL family of licenses.^[1] The task mix spans four language ecosystems: Go (106 tasks), Python (84), TypeScript and JavaScript together (56), and C and C++ (38).^[1] Each of the three categories uses its own grading pipeline that blends automated verification with rubric scoring performed by an LLM judge.

Codebase Q&A asks the agent a question about how a repository works, and answers are scored against a per-task rubric. Each task averages about 10.5 rubric items, split into Answer Comprehensiveness checks and Negative Rubrics, with the negative items treated as must-pass so that confidently wrong or fabricated claims are penalized.^[1] Test Writing asks the agent to add tests for a target, then runs a three-part evaluation: a manifest check (an LLM judge confirms the right things were tested), mutation testing (the new tests must pass on correct code and fail on deliberately mutated code), and a rubric covering Test Comprehensiveness as a must-pass plus Test Placement, Suite Conventions, and Bucket Conventions.^[1] These tasks are the most rubric-heavy, averaging about 17.1 rubric items each.^[1] Refactoring asks the agent to restructure code without changing behavior, and it is graded by running the existing regression suite (about 18 tests per task on average) alongside a rubric of roughly 17.4 items covering Code Maintainability and Artifact Cleanup as must-pass criteria, plus Documentation Maintainability and Negative Rubrics.^[1] The mix of mutation testing, regression testing, and must-pass rubrics is what lets the benchmark separate "the code runs" from "the code is good."

The table below summarizes the three workflows.

Task category	Tasks	What the agent does	How it is scored
Codebase Q&A	124	Answer a question about an unfamiliar codebase	Rubric grading (~10.5 items); comprehensiveness plus must-pass negative rubrics ^[1]
Test Writing	90	Write tests for a target in the repository	Manifest check, mutation testing, and rubric (~17.1 items) ^[1]
Refactoring	70	Restructure code while preserving behavior	Regression tests (~18 per task) plus maintainability and cleanup rubric (~17.4 items) ^[1]

Results

The paper evaluates a range of frontier and open-weight systems, run through real coding-agent harnesses such as Codex and Claude Code.^[1] The headline metric is Pass@1, the average per-trial pass rate across all 284 tasks, reported with a 95 percent confidence interval; the paper also reports Pass3, the fraction of tasks a model solves on all three independent trials, as a consistency measure.^[1] GPT-5.4 running in Codex leads at 43.49 Pass@1, narrowly ahead of Opus 4.7 in Claude Code at 41.89.^[1] Selected results are shown below.^[1]

Model (harness)	Pass@1 (± 95% CI)	Pass3	Q&A	Test Writing	Refactoring
GPT-5.4 (Codex)	43.49 ± 3.32	29.2%	40.80%	44.36%	44.29%
Opus 4.7 (Claude Code)	41.89 ± 3.31	29.2%	40.30%	38.51%	48.57%
GPT-5.3 (Codex)	37.38 ± 3.25	24.3%	32.60%	38.98%	42.38%
Opus 4.6 (Claude Code)	34.93 ± 3.20	22.9%	33.30%	36.67%	35.58%
Sonnet 4.6 (Claude Code)	31.63 ± 3.12	14.4%	31.20%	31.76%	32.21%
Gemini 3.1 Pro	25.23 ± 2.91	13.9%	16.03%	31.23%	33.81%
GLM 5 (open-weight)	24.03 ± 2.87	11.6%	20.50%	28.74%	24.24%

Even the top scores sit in the low-to-mid 40s, well below the saturating numbers that frontier models now post on SWE-bench Verified, which is the point: SWE-Atlas is meant to be hard and to stay informative as agents improve. The best open-weight entry, GLM 5, trails the frontier proprietary systems by close to 20 points.^[1] The Pass3 column tells a second story. Every model drops sharply from Pass@1 to Pass3, in many cases losing 30 to 50 percent of its score, which means agents do not reliably reproduce a correct result across repeated attempts.^[1]

Main findings

The recurring theme across all three categories is that functional correctness runs ahead of engineering quality. Models pass the mechanical checks, mutation tests and regression tests, far more often than they pass the rubric checks, with gaps of roughly 10 to 40 points between the two.^[1] In other words, agents can make code work without making it good.

In Codebase Q&A, the authors find that frontier models have become "execution oriented," running code in the sandbox to ground their answers rather than reasoning from the source text alone.^[1] The failure modes differ by family: Claude models most often fall short on supplying runtime evidence for their claims (about 46 percent of their failures), while GPT models more often give incomplete or missing information.^[1] In Test Writing, agents reliably produce comprehensive-looking suites but lean on weak assertions, testing mainly the happy path and missing boundary conditions and negative-space cases; top models reach about 44 percent on mutation testing but only around 34 percent on the rubric.^[1] In Refactoring, performance degrades as the change grows. Models handle localized edits but struggle with multi-file refactors, missing call sites on a meaningful share of tasks, and their most common rubric failures are artifact cleanup and code maintainability.^[1] The refactoring tasks are also far larger in scope than prior benchmarks: by lines changed they are roughly twice the size of SWE-Bench Pro tasks and about 30 times the size of SWE-bench Verified tasks.^[1]

Significance

SWE-Atlas reflects a broader shift in how the field thinks about agentic coding. As issue-resolution scores climb toward saturation, a benchmark that holds models accountable for maintainability, test rigor, and clean refactors gives a more honest read on whether an agent could be trusted with day-to-day engineering rather than just well-scoped bug tickets. The hybrid evaluation design, programmatic checks for what can be automated and rubric grading for what cannot, is also a notable methodological contribution, since it tries to operationalize "good engineering" in a way that does not collapse into a single test gate. The benchmark, dataset, and evaluation harness are released openly under the Apache 2.0 license through Scale AI's research arm, with separate public leaderboards for each of the three workflows.^[2]^[3] The authors position it not as a replacement for SWE-bench-style benchmarks but as a complement that measures correctness and engineering quality side by side.^[1]

References

Raghavendra, Mohit; Dan, Soham; Romero Calvo, Miguel; He, Yannis Yiming; Mols, Johannes Baptist; Anand, Gautam; McCollum, Cole; Arakelyan, Edgar; Bharadwaj, Vijay; Park, Andrew; Da, Jeff; Rezaei, MohammadHossein; Liu, Bing; Kenstler, Brad; He, Yunzhong. "SWE Atlas: Benchmarking Coding Agents Beyond Issue Resolution." arXiv:2605.08366, submitted 8 May 2026. https://arxiv.org/abs/2605.08366 ↩
Scale Labs. "SWE Atlas: Benchmarking Coding Agents Beyond Issue Resolution." https://labs.scale.com/papers/sweatlas ↩
Scale AI. "SWE-Atlas" (open-source benchmark repository). GitHub. https://github.com/scaleapi/SWE-Atlas ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

Terminal-Bench

Why it extends issue-resolution benchmarks

What it measures and how

Results

Main findings

Significance

References

Improve this article

Related Articles

Terminal-Bench

SWE-Bench Pro

Claude Code

AI coding agent

Autonomous coding

Cline (AI coding agent)

What links here

Related Articles

Terminal-Bench

SWE-Bench Pro

Claude Code

AI coding agent

Autonomous coding

Cline (AI coding agent)