SWE-agent
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,750 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,750 words
Add missing citations, update stale details, or suggest a clearer explanation.
SWE-agent is an open-source autonomous software engineering agent created by the Princeton NLP group (with co-authors from Stanford) and first released on April 2, 2024.[1][2] Built on top of a language model such as GPT-4 or Claude, it takes a real GitHub issue plus the surrounding repository and attempts to produce a code patch that resolves the issue, mediated by a deliberately designed set of command-line tools called the Agent-Computer Interface (ACI).[3] The accompanying paper, "SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering" (arXiv:2405.15793), was published at NeurIPS 2024 and reported a 12.5% resolve rate on the full SWE-bench test set, a significant jump over prior non-interactive baselines at the time.[3][4] SWE-agent is significant less for the absolute number than for codifying the ACI thesis: that language model agents are a new class of end users for whom interfaces should be redesigned, and that good interface design contributes as much to task performance as model scale.[3]
| Field | Value |
|---|---|
| Type | Autonomous software engineering AI agent |
| Developer | Princeton NLP (with Stanford NLP collaborators) |
| Initial release | April 2, 2024 (v0.1)[2] |
| Latest major version | v1.1.0 (May 22, 2025)[2] |
| License | MIT[1] |
| Paper | arXiv:2405.15793 (May 2024); NeurIPS 2024[3][4] |
| Repository | github.com/SWE-agent/SWE-agent[1] |
| Key benchmark | SWE-bench (12.5% pass@1 at launch with GPT-4)[3] |
| Variants | SWE-agent EnIGMA, SWE-agent-LM, mini-SWE-agent[1][5][6] |
SWE-agent grew directly out of the SWE-bench benchmark, which had been released by the same Princeton lab in October 2023 and which presented a then-unsolved evaluation set of 2,294 real GitHub issues paired with executable test suites.[7] The benchmark's initial baseline numbers were strikingly low: zero-shot prompting of GPT-4 with a "retrieve, then patch" pipeline solved only a small fraction of issues, and the field lacked an open agentic system that could iteratively act on a repository.[7] In March 2024, Cognition AI's announcement of Devin and its 13.86% score on a 25% subset of SWE-bench drew widespread attention but was a closed system; this created demand for an open, reproducible counterpart.[8]
The Princeton team, John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press, released SWE-agent on April 2, 2024 along with a project page and an open codebase. The system immediately set a new state of the art on the public SWE-bench leaderboard at 12.47% (later restated as 12.5% in the published paper) using GPT-4.[9][3] The release came roughly three weeks after Devin's public reveal and was widely covered as the first reproducible open-source agent operating in the same regime.[8]
The team had previously published several pieces of upstream research that fed directly into SWE-agent's design. Co-author Shunyu Yao was a primary author of ReAct (2022), an early framework in which a language model alternates between explicit reasoning steps and tool actions; SWE-agent inherits this thought-then-action structure at the level of each turn.[3] Co-author Carlos E. Jimenez led the SWE-bench paper, ensuring tight alignment between the benchmark and the agent designed against it. Co-author Ofir Press, then a postdoctoral researcher at Princeton, had previously worked on long-context language model evaluation and contributed the empirical, ablation-heavy framing of the SWE-agent paper.[3]
A first preprint of the paper, "SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering," appeared on arXiv on May 6, 2024 (revised May 30, 2024, and November 11, 2024).[3] The work was accepted at the Thirty-Eighth Conference on Neural Information Processing Systems (NeurIPS 2024) and appears in the official proceedings.[4] The paper's core empirical claims, beyond the headline benchmark numbers, were a set of controlled ablations showing that the design of the interface itself, including file viewer window size, edit linting, and search output formatting, accounted for a large share of the gains over a naive shell-only baseline.[3]
The paper opens by framing language model agents as a new class of end user. Traditional human-facing computer interfaces are optimized for a user who has high-resolution vision, dexterous pointing, and very limited working memory; a language model has the opposite profile (no vision in the textual setting, no pointer, but a fairly large working buffer and a strong but error-prone implicit knowledge of tools). The argument is that giving such a user the standard human shell, while convenient, is leaving performance on the table; the rest of the paper supplies the experimental evidence.[3] This framing has been widely cited in subsequent agent work and is the most durable contribution of the paper, more so than the specific 12.5% number.[13]
After the initial v0.1 release the project moved through several milestones documented in the public GitHub release history:[2]
In parallel, the Princeton team released mini-SWE-agent in July 2025, a deliberate simplification that strips away nearly all of the original tool scaffolding while still scoring competitively on SWE-bench Verified.[6] The original SWE-agent repository was placed in maintenance-only mode in late 2025, with new development concentrated on mini-SWE-agent and SWE-smith.[5]
The central conceptual contribution of the paper is the Agent-Computer Interface, abbreviated ACI. The authors argue that a language model interacting with a computer is a new kind of end user: its needs differ from those of a human (it has no eyes, no mouse, and degraded performance on long noisy contexts) but also differ from those of a traditional program (it can read prose, follow instructions, and tolerate small natural-language conventions).[3] An ACI is therefore a thin, deliberately designed shell layer that exposes the operating system and the code repository through commands optimized for an LLM, rather than reusing the standard human-oriented Unix toolchain.[3]
The ACI thesis can be stated as three design pressures, each derived from the paper's ablations:[3]
grep -r over a large repository can return thousands of lines and overflow context; the SWE-agent search_dir command instead returns only the filenames containing a match, with the per-file hit count.[10] The same logic applies to file viewing.goto, scroll_up, scroll_down), which keeps every observation small and uniform in shape.[10]The paper frames the ACI as analogous to the role of human-computer interface design in the history of software: just as visual UIs replaced raw command lines for human users, an LLM-facing interface must be invented from scratch rather than inherited from human conventions.[3]
SWE-agent's default toolset is intentionally small. The exact list has evolved across releases, but the core tools described in the paper and the v1.0 documentation include the following.[3][10]
| Tool | Purpose |
|---|---|
open | Open a file at an optional line number and load it into the file viewer window. |
goto | Jump the file viewer window to a specific line of the currently open file. |
scroll_up / scroll_down | Move the file viewer window by one page (default 100 lines) in either direction. |
edit | Replace a specified line range in the open file with new content; output is linted, and syntactic failures are rejected with a diagnostic. |
search_file | Search inside the currently open file for a substring; output is bounded. |
search_dir | Search the project directory for a substring and return the matching filenames (with hit counts) rather than full lines. |
find_file | Locate a file by name pattern within the project tree. |
submit | End the episode and present the working tree as the candidate patch for grading. |
The file viewer presents at most 100 lines per turn, annotated with line numbers and an ellipsis marker for hidden ranges; both the size and the line numbering are findings from the paper's ablations, which showed that very large viewer windows degrade performance because they crowd out reasoning tokens.[3][10] When a command produces no output, SWE-agent returns the explicit string "Your command ran successfully and did not produce any output." rather than an empty turn, again so that the model receives a clean, structured observation.[10]
In v0.7, the EnIGMA release added a class of Interactive Agent Tools (IATs): wrappers around long-lived utilities such as a debugger (gdb) that the agent can drive in a turn-by-turn manner from the same loop, plus output summarizers that compress very long traces before they re-enter context.[11]
At the implementation level SWE-agent is a fairly direct agent loop. A single rollout proceeds as follows:[3][10]
submit, the cost or turn budget is exhausted, or the harness encounters an error.The full prompt template, tool definitions, and per-task setup live in a single YAML configuration file, which the project emphasizes as a deliberate research-friendly choice: a new researcher can fork the YAML, change a tool description or a window size, and rerun the benchmark without touching the Python core.[1] In v1.0 the execution backend was replaced with SWE-ReX, a separate parallelization layer that allows hundreds of rollouts to run concurrently across containers or even remote workers; this turned what had been a 24-hour serial benchmark run on a laptop into a much faster sweep.[2]
The default action format is a thin wrapper around shell-style commands. The model is instructed to emit a fenced block containing exactly one command, optionally preceded by a natural-language thought, for example:
THOUGHT: I should locate the file that defines the affected function.
ACTION:
search_dir "compute_loss"
The harness's regex parser extracts the action, refuses anything containing more than one command, and returns a syntax-error observation if parsing fails. This deliberately heavy-handed parser is a frequent source of model frustration but, in the authors' view, is what makes long rollouts stable; without it, models routinely emit multi-command actions that interleave reasoning with execution and cause non-deterministic environment changes.[3]
In v1.0, an alternative tool-bundle interface was added, in which each tool is registered with the LLM provider's native function calling schema. This works directly with providers that support strict JSON schemas (such as OpenAI's structured outputs) and falls back to a YAML-described tool grammar elsewhere through LiteLLM.[2]
Each task is executed inside a Docker image that has been pre-built with the target repository at the appropriate commit and with the project's test dependencies installed. SWE-bench provides those images for its task instances, and SWE-agent re-uses them directly. The agent's commands run inside the container; observations are returned as plain text. This isolation is what makes large parallel sweeps tractable: hundreds of containers can run in parallel without interfering with each other, and a misbehaving rollout cannot corrupt the host filesystem.[1][10]
Every LLM call is logged with token counts and dollar costs (using LiteLLM-reported per-provider prices), so a benchmark run produces a single CSV per task with all costs, latencies, and turn counts. This bookkeeping is what makes the published numbers reproducible: a researcher can not only repeat the benchmark but also confirm that they spent a comparable amount of money doing so.[2]
When SWE-agent landed on April 2, 2024, the public SWE-bench leaderboard was effectively bare: the only entries were the unagentic baselines from the original SWE-bench paper and a handful of internal results.[7] SWE-agent's 12.47% resolve rate on the full set with GPT-4 vaulted it to the top of the open leaderboard, sitting only a small margin below Cognition AI's closed Devin result of 13.86% (which was measured on a random 25% subset rather than the full test set, complicating direct comparison).[8][9]
The paper also reports an 87.7% pass@1 on HumanEvalFix, a smaller bug-fixing benchmark, again with GPT-4, demonstrating that the agent's gains transferred to a non-SWE-bench setting.[3] Ablations in the paper, holding the base model constant, showed that removing the file viewer linter or expanding the viewer window from 100 lines to a much larger size each cost several percentage points on SWE-bench Lite, providing direct evidence that the gains came from interface design rather than from any change in the underlying language model.[3]
Beyond GPT-4, the original paper reports baseline numbers for Claude 2 and a small open-weights model, in each case using the same harness and the same prompts. The general pattern was monotonic: stronger base models extracted more value from the ACI, but the ACI itself helped every model relative to a raw shell baseline, including weaker ones. This was the empirical basis for the paper's central claim that interface design is complementary to model scale rather than a substitute for it.[3]
Over the following months SWE-agent itself was overtaken on the leaderboard, but mostly by systems that inherited its template:
The longer-term impact is structural rather than numeric. SWE-agent provided the first openly reproducible recipe for the "agent + SWE-bench" loop, and most subsequent SWE-bench submitters (whether academic or industrial) have either forked SWE-agent's harness directly, copied its tool design, or built a near-equivalent ACI of their own. A 2025 survey of the SWE-bench leaderboards classified the majority of submitted systems as either SWE-agent forks or close architectural cousins.[13]
EnIGMA, short for Enhanced Interactive Generative Model Agent, is a port of SWE-agent to offensive cybersecurity Capture-the-Flag (CTF) challenges, released as SWE-agent v0.7 in September 2024.[11] The accompanying paper, "EnIGMA: Interactive Tools Substantially Assist LM Agents in Finding Security Vulnerabilities" (arXiv:2409.16165), was led by researchers at Tel Aviv University, NYU Tandon, and Princeton, and was accepted at ICML 2025.[14]
EnIGMA's main technical contribution is the Interactive Agent Tool concept: a wrapper that lets the agent drive a long-running interactive program (a debugger, a network client) over multiple turns while preserving its state, rather than restarting the program each turn. The authors also identify and name a failure mode they call "soliloquizing," in which models hallucinate tool outputs without actually invoking the tool, and propose interface-level mitigations.[14] On the NYU CTF benchmark and Intercode-CTF, EnIGMA reportedly solves more than three times as many challenges as the previous best agent, with a 13.5% pass rate across roughly 200 NYU CTF challenges and state-of-the-art results across the four benchmarks (totalling 390 challenges) evaluated in the paper.[14]
SWE-agent-LM is an open-weights fine-tune of Qwen 2.5 Coder Instruct 32B that is trained to drive the SWE-agent harness. It was released alongside the SWE-smith data pipeline in v1.1.0 (May 22, 2025).[2][5]
SWE-smith itself ("SWE-smith: Scaling Data for Software Engineering Agents," arXiv:2504.21798) is a paper by John Yang and collaborators that describes a recipe for synthesizing software engineering training tasks at scale from arbitrary Python codebases.[5] Using the pipeline, the authors generated 50,000 task instances from 128 GitHub repositories, an order of magnitude larger than prior datasets, then collected about 5,000 expert agent trajectories with Claude 3.7 Sonnet running inside SWE-agent. Fine-tuning Qwen 2.5 Coder 32B on those trajectories yielded SWE-agent-LM-32B, which achieves 40.2% pass@1 on SWE-bench Verified and was, at release, the new state of the art for open-weight models.[5] The work was accepted as a NeurIPS 2025 Datasets and Benchmarks spotlight.[5]
mini-SWE-agent is a deliberately minimal rewrite released in July 2025.[6] The project's stated thesis is that most of SWE-agent's tool scaffolding can be removed without large performance losses, provided the base model is strong enough to drive a raw bash shell on its own. The agent class is roughly 100 lines of Python; it has no custom file viewer, no custom edit command, no built-in linter feedback loop, and no stateful shell; each tool call is an independent subprocess.run invocation, and history is a flat message list.[6]
Despite this, mini-SWE-agent has reported >74% on SWE-bench Verified with a strong base model, and it has displaced the full SWE-agent as the team's recommended starting point for new users.[6] The repository's README lists adoption by Meta, NVIDIA, IBM, Anyscale, Princeton University, Stanford University, and others as scaffolds in their own SWE-bench evaluations.[6]
The mini-SWE-agent design reflects a broader shift in the agent research community sometimes referred to as the "minimum viable agent" or "scaffolding minimalism" trend. The argument runs: as frontier models have absorbed more shell, editing, and code-reading capability into their pretraining and post-training distributions, the marginal value of a hand-built ACI has decreased; what is left is essentially a thin parser around a single shell call. The original SWE-agent paper, by contrast, was written in an era when GPT-4 could not reliably drive a raw shell over thirty turns; the ACI was, in effect, a crutch that made the agent loop possible at all. Mini-SWE-agent is a deliberate test of how much of that crutch can be removed.[6]
SWE-agent is significant in three connected ways.
First, it provided the first openly reproducible scaffold for the SWE-bench task, which is widely treated as the canonical benchmark for "real software engineering" by language model agents. Before SWE-agent, SWE-bench had no public agentic baseline; afterward, every major frontier model has a SWE-bench number, and almost all of those numbers were obtained inside SWE-agent or a clear derivative.[13]
Second, it crystallized the Agent-Computer Interface concept that has since spread well beyond software engineering. The same design pressures (bounded outputs, abstract navigation, structured edit feedback, fast in-loop validation) reappear in agents for the web, for spreadsheets, for computer use, and for cybersecurity. The EnIGMA paper makes the generalization explicit by transferring the ACI to a different domain and getting matching gains.[14]
Third, it served as a research vehicle. The same Princeton group used SWE-agent as the harness for SWE-bench Verified evaluations, for the SWE-smith data pipeline, for SWE-agent-LM, and for mini-SWE-agent; the rapid cadence of follow-up papers and tools all sharing a code base reflects the project's deliberate research-friendly design.[5][6]
A fourth, more sociological consequence is that SWE-agent normalized the practice of evaluating coding agents on a held-out, executable test set rather than on toy puzzles. The pre-SWE-bench evaluation tradition (HumanEval, MBPP, APPS) measured a model's ability to write a small self-contained function; the post-SWE-agent tradition treats coding as a multi-turn, repository-scale activity in which the model must read, edit, and test code it did not write. This shift in evaluation methodology has, in turn, shifted what major labs report when they release a model. By 2026, almost every frontier-model launch (from OpenAI, Anthropic, Google DeepMind, xAI, and others) features a SWE-bench Verified number among its headline benchmarks, where in 2023 such a number would have been unusual.[12]
SWE-agent sits in a family of LLM-driven coding and task agents, which can be roughly partitioned along two axes: how much custom tool scaffolding they layer on top of the base model, and whether they target benchmarks, developer workflows, or end-user product use. Several of the systems most often compared with SWE-agent are summarized below; this list is illustrative rather than exhaustive.
| System | Type | Tool philosophy | Source | Primary benchmark posture |
|---|---|---|---|---|
| SWE-agent | Research framework | Custom ACI: file viewer, scoped search, linted edit | Open (MIT)[1] | Native SWE-bench baseline |
| mini-SWE-agent | Research framework | Minimalist: shell + LM only[6] | Open (MIT) | SWE-bench Verified |
| OpenHands (formerly OpenDevin) | Open developer platform | Sandboxed shell + web browser + code tools | Open | SWE-bench leader among open systems[15] |
| Devin (Cognition AI) | Commercial product | Long-horizon planner + browser + shell | Closed | SWE-bench evaluations + product[8] |
| Aider | Developer CLI | Git-aware edits with repo map | Open | Aider Polyglot, occasional SWE-bench |
| Claude Code | Commercial CLI (Anthropic) | Native tool use over local FS + bash | Closed (CLI) | Internal SWE-bench numbers |
| OpenAI Codex CLI | Commercial CLI (OpenAI) | Native function calling over local FS + bash | Closed | Internal SWE-bench numbers |
| Cline | VSCode agent | IDE-integrated edit/run loop | Open | Aider Polyglot, anecdotal |
| Cursor | IDE | Inline + agent mode with custom tools | Closed | Anecdotal |
| Auto-GPT | Early autonomous agent | General-purpose tools, web + memory | Open | Pre-SWE-bench, no canonical number |
| BabyAGI | Early task agent | Task list + memory loop | Open | Pre-SWE-bench, no canonical number |
The most direct contrast is with Auto-GPT and BabyAGI from 2023, which popularized the basic agent loop pattern but operated with general-purpose tools and no benchmark scaffold; SWE-agent kept the loop but discarded the general-purpose tools in favor of a specialized ACI, and bound the system to a hard benchmark.[3] The most direct contrast in the other direction is with Devin, whose closed nature triggered the demand that SWE-agent met, and with OpenHands, a contemporaneous open project that pursued a broader developer-platform vision rather than a research-baseline vision.[15]
The minimalist branch of the family, exemplified by mini-SWE-agent, Aider, and to some extent Claude Code and Codex CLI, deliberately reverses the ACI design pressure: as base models have gotten stronger, the value of a custom tool layer has shrunk relative to giving the model a raw shell and a clear prompt. SWE-agent itself acknowledges this trend by recommending mini-SWE-agent as the default entry point for new users.[5][6]
The IDE-embedded branch (Cursor, Cline, GitHub Copilot in agent mode) shares SWE-agent's tool-call loop but adds two extra constraints: the user is in the loop, providing approval or rejection of each diff; and the agent's tools are tied to editor primitives (open file in tab, run linter, run debugger) rather than to a generic shell. Empirically, the two approaches are converging: IDE agents have grown background autonomous modes, and benchmark agents like SWE-agent have grown interactive trajectory viewers. The differences between them are increasingly about who the user is (a developer at a keyboard vs. a benchmark harness) rather than about underlying mechanism.
Within the academic agent literature, SWE-agent occupies a particular slot: it is the canonical example used to explain what an Agent-Computer Interface is. Subsequent papers proposing new ACIs for other domains (the EnIGMA paper for cybersecurity, the SWE-smith paper for data generation, various GUI-agent papers) cite SWE-agent as the prior art that defined the design space. This is partly a function of timing (it was the first widely-distributed open ACI) and partly a function of the paper's clarity in stating the ACI thesis as a generalizable design principle.[13][14]
The paper and subsequent commentary identify a number of limitations that apply both to SWE-agent specifically and to the SWE-bench-driven agent research program more broadly.