SWE-Atlas
Last reviewed
Jun 3, 2026
Sources
3 citations
Review status
Source-backed
Revision
v1 · 1,534 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
3 citations
Review status
Source-backed
Revision
v1 · 1,534 words
Add missing citations, update stale details, or suggest a clearer explanation.
SWE-Atlas is a benchmark for evaluating AI coding agents on professional software-engineering work that goes beyond fixing bugs and resolving issues. Built by Scale AI and introduced in the May 2026 paper SWE Atlas: Benchmarking Coding Agents Beyond Issue Resolution (arXiv:2605.08366), it groups 284 expert-authored tasks into three workflows that earlier benchmarks largely ignored: Codebase Q&A (124 tasks), Test Writing (90 tasks), and Refactoring (70 tasks).[1][2] Rather than scoring only whether a patch makes a hidden test suite pass, SWE-Atlas pairs programmatic checks with rubric-based grading so that it can measure engineering quality, including test and refactor completeness, maintainability, reusable abstractions, and codebase hygiene.[1] The paper, submitted on 8 May 2026 by a team of 15 Scale AI researchers led by Mohit Raghavendra and including Soham Dan, Yannis Yiming He, and Yunzhong He, reports that frontier models such as GPT-5.4 and Claude Opus 4.7 lead the field while the best open-weight models score poorly.[1][3]
Most agentic-coding evaluation has converged on the issue-resolution format popularized by SWE-bench, where a model receives a GitHub issue and a repository snapshot and must produce a patch that passes a held-out test suite.[1] That format has been productive, and it spawned a whole family of variants, including SWE-bench Verified, the multilingual Multi-SWE-bench, and the enterprise-scale SWE-Bench Pro. But it captures a narrow slice of what software engineers actually do. A working developer spends a lot of time reading unfamiliar code to answer questions, writing tests to lock in behavior, and refactoring to keep a codebase healthy, and none of that reduces cleanly to "did the patch turn the tests green."
SWE-Atlas was designed to fill that gap. The authors describe three deliberate departures from prior benchmarks.[1] First, it targets task categories that are underrepresented but practically important: comprehension, testing, and refactoring rather than only bug fixes and feature work. Second, it uses category-specific evaluation protocols instead of a single pass/fail test gate, because a good answer to a codebase question or a well-structured refactor cannot be judged by tests alone. Third, the tasks are deliberately under-specified and agentic. Prompts read more like the vague requests a real teammate would send than the tightly scoped issue reports used in SWE-bench, so the agent has to explore the repository, gather context, and decide what to do.
The 284 tasks are drawn from 18 actively maintained open-source repositories, most of them under the GPL family of licenses.[1] The task mix spans four language ecosystems: Go (106 tasks), Python (84), TypeScript and JavaScript together (56), and C and C++ (38).[1] Each of the three categories uses its own grading pipeline that blends automated verification with rubric scoring performed by an LLM judge.
Codebase Q&A asks the agent a question about how a repository works, and answers are scored against a per-task rubric. Each task averages about 10.5 rubric items, split into Answer Comprehensiveness checks and Negative Rubrics, with the negative items treated as must-pass so that confidently wrong or fabricated claims are penalized.[1] Test Writing asks the agent to add tests for a target, then runs a three-part evaluation: a manifest check (an LLM judge confirms the right things were tested), mutation testing (the new tests must pass on correct code and fail on deliberately mutated code), and a rubric covering Test Comprehensiveness as a must-pass plus Test Placement, Suite Conventions, and Bucket Conventions.[1] These tasks are the most rubric-heavy, averaging about 17.1 rubric items each.[1] Refactoring asks the agent to restructure code without changing behavior, and it is graded by running the existing regression suite (about 18 tests per task on average) alongside a rubric of roughly 17.4 items covering Code Maintainability and Artifact Cleanup as must-pass criteria, plus Documentation Maintainability and Negative Rubrics.[1] The mix of mutation testing, regression testing, and must-pass rubrics is what lets the benchmark separate "the code runs" from "the code is good."
The table below summarizes the three workflows.
| Task category | Tasks | What the agent does | How it is scored |
|---|---|---|---|
| Codebase Q&A | 124 | Answer a question about an unfamiliar codebase | Rubric grading (~10.5 items); comprehensiveness plus must-pass negative rubrics [1] |
| Test Writing | 90 | Write tests for a target in the repository | Manifest check, mutation testing, and rubric (~17.1 items) [1] |
| Refactoring | 70 | Restructure code while preserving behavior | Regression tests (~18 per task) plus maintainability and cleanup rubric (~17.4 items) [1] |
The paper evaluates a range of frontier and open-weight systems, run through real coding-agent harnesses such as Codex and Claude Code.[1] The headline metric is Pass@1, the average per-trial pass rate across all 284 tasks, reported with a 95 percent confidence interval; the paper also reports Pass3, the fraction of tasks a model solves on all three independent trials, as a consistency measure.[1] GPT-5.4 running in Codex leads at 43.49 Pass@1, narrowly ahead of Opus 4.7 in Claude Code at 41.89.[1] Selected results are shown below.[1]
| Model (harness) | Pass@1 (± 95% CI) | Pass3 | Q&A | Test Writing | Refactoring |
|---|---|---|---|---|---|
| GPT-5.4 (Codex) | 43.49 ± 3.32 | 29.2% | 40.80% | 44.36% | 44.29% |
| Opus 4.7 (Claude Code) | 41.89 ± 3.31 | 29.2% | 40.30% | 38.51% | 48.57% |
| GPT-5.3 (Codex) | 37.38 ± 3.25 | 24.3% | 32.60% | 38.98% | 42.38% |
| Opus 4.6 (Claude Code) | 34.93 ± 3.20 | 22.9% | 33.30% | 36.67% | 35.58% |
| Sonnet 4.6 (Claude Code) | 31.63 ± 3.12 | 14.4% | 31.20% | 31.76% | 32.21% |
| Gemini 3.1 Pro | 25.23 ± 2.91 | 13.9% | 16.03% | 31.23% | 33.81% |
| GLM 5 (open-weight) | 24.03 ± 2.87 | 11.6% | 20.50% | 28.74% | 24.24% |
Even the top scores sit in the low-to-mid 40s, well below the saturating numbers that frontier models now post on SWE-bench Verified, which is the point: SWE-Atlas is meant to be hard and to stay informative as agents improve. The best open-weight entry, GLM 5, trails the frontier proprietary systems by close to 20 points.[1] The Pass3 column tells a second story. Every model drops sharply from Pass@1 to Pass3, in many cases losing 30 to 50 percent of its score, which means agents do not reliably reproduce a correct result across repeated attempts.[1]
The recurring theme across all three categories is that functional correctness runs ahead of engineering quality. Models pass the mechanical checks, mutation tests and regression tests, far more often than they pass the rubric checks, with gaps of roughly 10 to 40 points between the two.[1] In other words, agents can make code work without making it good.
In Codebase Q&A, the authors find that frontier models have become "execution oriented," running code in the sandbox to ground their answers rather than reasoning from the source text alone.[1] The failure modes differ by family: Claude models most often fall short on supplying runtime evidence for their claims (about 46 percent of their failures), while GPT models more often give incomplete or missing information.[1] In Test Writing, agents reliably produce comprehensive-looking suites but lean on weak assertions, testing mainly the happy path and missing boundary conditions and negative-space cases; top models reach about 44 percent on mutation testing but only around 34 percent on the rubric.[1] In Refactoring, performance degrades as the change grows. Models handle localized edits but struggle with multi-file refactors, missing call sites on a meaningful share of tasks, and their most common rubric failures are artifact cleanup and code maintainability.[1] The refactoring tasks are also far larger in scope than prior benchmarks: by lines changed they are roughly twice the size of SWE-Bench Pro tasks and about 30 times the size of SWE-bench Verified tasks.[1]
SWE-Atlas reflects a broader shift in how the field thinks about agentic coding. As issue-resolution scores climb toward saturation, a benchmark that holds models accountable for maintainability, test rigor, and clean refactors gives a more honest read on whether an agent could be trusted with day-to-day engineering rather than just well-scoped bug tickets. The hybrid evaluation design, programmatic checks for what can be automated and rubric grading for what cannot, is also a notable methodological contribution, since it tries to operationalize "good engineering" in a way that does not collapse into a single test gate. The benchmark, dataset, and evaluation harness are released openly under the Apache 2.0 license through Scale AI's research arm, with separate public leaderboards for each of the three workflows.[2][3] The authors position it not as a replacement for SWE-bench-style benchmarks but as a complement that measures correctness and engineering quality side by side.[1]