SWE-agent

AI Agents AI Code Generation Open Source AI

26 min read

Updated Jun 25, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 25, 2026

Fact-checked

In review queue

Sources

15 citations

Revision

v3 · 5,188 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

SWE-agent is an open-source autonomous software engineering agent created by the Princeton NLP group (Princeton Language and Intelligence, with co-authors from Stanford) and first released on April 2, 2024.^[1]^[2] Built on top of a large language model such as GPT-4 or Claude, it takes a real GitHub issue plus the surrounding repository and attempts to produce a code patch that resolves the issue, mediated by a deliberately designed set of command-line tools the authors call the Agent-Computer Interface (ACI).^[3] The accompanying paper, "SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering" (arXiv:2405.15793), was published at NeurIPS 2024 and reported a 12.5% pass@1 resolve rate on the full SWE-bench test set with GPT-4 Turbo, more than triple the previous best of 3.8% set by a non-interactive, retrieval-augmented system.^[3]^[4] SWE-agent matters less for that absolute number than for codifying the ACI thesis: that language model agents are, in the paper's words, "a new category of end users with their own needs and abilities," and that good interface design contributes as much to task performance as model scale.^[3]

What is SWE-agent?

SWE-agent is a research framework and harness that turns a general-purpose large language model into an autonomous AI agent for software engineering. Given a natural-language bug report or feature request drawn from a real GitHub repository, it autonomously navigates the codebase, edits files, runs tests, and submits a candidate patch, all through a small set of LLM-optimized commands rather than the raw Unix shell.^[1]^[3] It is released under the MIT license and is the canonical open-source baseline for the SWE-bench benchmark.^[1]^[7] The project's one-line description is simply: "SWE-agent takes a GitHub issue and tries to automatically fix it, using your LM of choice."^[1]

Infobox

Field	Value
Type	Autonomous software engineering AI agent
Developer	Princeton NLP / Princeton Language and Intelligence (with Stanford NLP collaborators)
Initial release	April 2, 2024 (v0.1)^[2]
Latest major version	v1.1.0 (May 22, 2025)^[2]
License	MIT^[1]
Paper	arXiv:2405.15793 (May 2024); NeurIPS 2024^[3]^[4]
Repository	github.com/SWE-agent/SWE-agent^[1]
Key benchmark	SWE-bench (12.5% pass@1 at launch with GPT-4 Turbo)^[3]
Variants	SWE-agent EnIGMA, SWE-agent-LM, mini-SWE-agent^[1]^[5]^[6]

History

Origins at Princeton NLP

SWE-agent grew directly out of the SWE-bench benchmark, which had been released by the same Princeton lab in October 2023 and which presented a then-unsolved evaluation set of 2,294 real GitHub issues paired with executable test suites.^[7] The benchmark's initial baseline numbers were strikingly low: zero-shot prompting of GPT-4 with a "retrieve, then patch" pipeline solved only a small fraction of issues, and the field lacked an open agentic system that could iteratively act on a repository.^[7] In March 2024, Cognition AI's announcement of Devin and its 13.86% score on a 25% subset of SWE-bench drew widespread attention but was a closed system; this created demand for an open, reproducible counterpart.^[8]

The Princeton team, John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press, released SWE-agent on April 2, 2024 along with a project page and an open codebase. The system immediately set a new state of the art on the public SWE-bench leaderboard at 12.47% (rounded to 12.5% in the published paper) using GPT-4 Turbo.^[9]^[3] The release came roughly three weeks after Devin's public reveal and was widely covered as the first reproducible open-source agent operating in the same regime.^[8]

The team had previously published several pieces of upstream research that fed directly into SWE-agent's design. Co-author Shunyu Yao was a primary author of ReAct (2022), an early framework in which a language model alternates between explicit reasoning steps and tool actions; SWE-agent inherits this thought-then-action structure at the level of each turn.^[3] Co-author Carlos E. Jimenez led the SWE-bench paper, ensuring tight alignment between the benchmark and the agent designed against it. Co-author Ofir Press, then a postdoctoral researcher at Princeton, had previously worked on long-context language model evaluation and contributed the empirical, ablation-heavy framing of the SWE-agent paper.^[3]

Paper and NeurIPS 2024

A first preprint of the paper, "SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering," appeared on arXiv on May 6, 2024 (revised May 30, 2024, and November 11, 2024).^[3] The work was accepted at the Thirty-Eighth Conference on Neural Information Processing Systems (NeurIPS 2024) and appears in the official proceedings.^[4] The paper's core empirical claims, beyond the headline benchmark numbers, were a set of controlled ablations showing that the design of the interface itself, including file viewer window size, edit linting, and search output formatting, accounted for a large share of the gains over a naive shell-only baseline.^[3]

The paper opens by framing language model agents as a new class of end user. As the abstract puts it, "we posit that LM agents represent a new category of end users with their own needs and abilities, and would benefit from specially-built interfaces to the software they use."^[3] Traditional human-facing computer interfaces are optimized for a user who has high-resolution vision, dexterous pointing, and very limited working memory; a language model has the opposite profile (no vision in the textual setting, no pointer, but a fairly large working buffer and a strong but error-prone implicit knowledge of tools). The argument is that giving such a user the standard human shell, while convenient, is leaving performance on the table; the rest of the paper supplies the experimental evidence. The authors state the central finding directly: "We show that ACIs tailored specifically for LMs outperform existing user interfaces (UIs) designed for human users, such as the Linux shell."^[3] This framing has been widely cited in subsequent agent work and is the most durable contribution of the paper, more so than the specific 12.5% number.^[13]

Release cadence

After the initial v0.1 release the project moved through several milestones documented in the public GitHub release history:^[2]

v0.5.0 (May 28, 2024) added documentation and an image cache flag that materially sped up repeated runs.
v0.7.0 (September 25, 2024) introduced the EnIGMA cybersecurity mode, Interactive Agent Tools, and output Summarizers.
v1.0.0 (February 13, 2025) was a substantial rewrite: it introduced the SWE-ReX execution backend for parallel rollouts, tool bundles, retry configuration, LiteLLM-based model routing for nearly any model provider, and a CLI trajectory inspector.
v1.0.1 (February 28, 2025) recorded a state-of-the-art score on SWE-bench Full with Claude 3.7 Sonnet.
v1.1.0 (May 22, 2025) shipped alongside the SWE-smith data pipeline and the open-weights SWE-agent-LM-32B fine-tune.

In parallel, the Princeton team released mini-SWE-agent in July 2025, a deliberate simplification that strips away nearly all of the original tool scaffolding while still scoring competitively on SWE-bench Verified.^[6] The original SWE-agent repository was placed in maintenance-only mode in late 2025, with new development concentrated on mini-SWE-agent and SWE-smith.^[5]

What is the Agent-Computer Interface (ACI)?

The central conceptual contribution of the paper is the Agent-Computer Interface, abbreviated ACI. The authors argue that a language model interacting with a computer is a new kind of end user: its needs differ from those of a human (it has no eyes, no mouse, and degraded performance on long noisy contexts) but also differ from those of a traditional program (it can read prose, follow instructions, and tolerate small natural-language conventions).^[3] An ACI is therefore a thin, deliberately designed shell layer that exposes the operating system and the code repository through commands optimized for an LLM, rather than reusing the standard human-oriented Unix toolchain.^[3] The paper credits this layer with the bulk of SWE-agent's gains: in a controlled ablation on a 300-instance SWE-bench Lite subset, the full ACI solved 10.7 percentage points more instances than an otherwise identical baseline agent that used only the default Linux shell.^[3]

The ACI thesis can be stated as three design pressures, each derived from the paper's ablations:^[3]

Bounded, structured outputs. Tools should never dump unbounded text into the agent's context. A naive grep -r over a large repository can return thousands of lines and overflow context; the SWE-agent search_dir command instead returns only the filenames containing a match, with the per-file hit count, and caps results at 50 matches.^[10] The same logic applies to file viewing.
Stateful, incremental editing. Edits are described by an explicit line range plus a replacement block, and the tool prints the resulting file window so the agent immediately sees the consequence of its action. A syntactic linter rejects an edit and produces an error message if the result will not parse, so the model receives fast, scoped feedback rather than learning of failure only at test time.^[3]^[10]
Abstract navigation primitives. Reading code is decoupled from scrolling. The agent keeps an implicit "open file" and "current window" state and moves through it with named commands (goto, scroll_up, scroll_down), which keeps every observation small and uniform in shape.^[10]

The paper distills these into four guidelines: actions should be simple and easy to understand, actions should be compact and efficient, environment feedback should be informative but concise, and guardrails should mitigate error propagation and hasten recovery.^[3] It frames the ACI as analogous to the role of human-computer interface design in the history of software: just as visual UIs replaced raw command lines for human users, an LLM-facing interface must be invented from scratch rather than inherited from human conventions.^[3]

Tools

SWE-agent's default toolset is intentionally small. The exact list has evolved across releases, but the core tools described in the paper and the v1.0 documentation include the following.^[3]^[10]

Tool	Purpose
`open`	Open a file at an optional line number and load it into the file viewer window.
`goto`	Jump the file viewer window to a specific line of the currently open file.
`scroll_up` / `scroll_down`	Move the file viewer window by one page (default 100 lines) in either direction.
`edit`	Replace a specified line range in the open file with new content; output is linted, and syntactic failures are rejected with a diagnostic.
`search_file`	Search inside the currently open file for a substring; output is bounded.
`search_dir`	Search the project directory for a substring and return the matching filenames (with hit counts) rather than full lines.
`find_file`	Locate a file by name pattern within the project tree.
`submit`	End the episode and present the working tree as the candidate patch for grading.

The file viewer presents at most 100 lines per turn, annotated with line numbers and an ellipsis marker for hidden ranges; both the size and the line numbering are findings from the paper's ablations, which showed that very large viewer windows degrade performance because they crowd out reasoning tokens.^[3]^[10] When a command produces no output, SWE-agent returns the explicit string "Your command ran successfully and did not produce any output." rather than an empty turn, again so that the model receives a clean, structured observation.^[10]

In v0.7, the EnIGMA release added a class of Interactive Agent Tools (IATs): wrappers around long-lived utilities such as a debugger (gdb) that the agent can drive in a turn-by-turn manner from the same loop, plus output summarizers that compress very long traces before they re-enter context.^[11]

How does SWE-agent work?

At the implementation level SWE-agent is a fairly direct agent loop. A single rollout proceeds as follows:^[3]^[10]

The agent receives a system prompt describing the available tools, the working directory, and the task (typically the GitHub issue title and body).
The language model emits a single action per turn: a "thought" in natural language followed by exactly one tool call. The tool grammar is enforced by a regex parser in the harness, not by JSON function-calling schemas, although v1.0 added optional support for native tool use APIs through LiteLLM.^[2]
The harness executes the tool against a sandboxed copy of the repository (a Docker container with the project's test dependencies installed), captures stdout/stderr, applies any output bounding, and returns a structured observation.
The observation is appended to the conversation and the model is queried again. The loop continues until the model issues submit, the cost or turn budget is exhausted, or the harness encounters an error.

The full prompt template, tool definitions, and per-task setup live in a single YAML configuration file, which the project emphasizes as a deliberate research-friendly choice: a new researcher can fork the YAML, change a tool description or a window size, and rerun the benchmark without touching the Python core.^[1] In v1.0 the execution backend was replaced with SWE-ReX, a separate parallelization layer that allows hundreds of rollouts to run concurrently across containers or even remote workers; this turned what had been a 24-hour serial benchmark run on a laptop into a much faster sweep.^[2]

Action format

The default action format is a thin wrapper around shell-style commands. The model is instructed to emit a fenced block containing exactly one command, optionally preceded by a natural-language thought, for example:

THOUGHT: I should locate the file that defines the affected function.
ACTION:
search_dir "compute_loss"

The harness's regex parser extracts the action, refuses anything containing more than one command, and returns a syntax-error observation if parsing fails. This deliberately heavy-handed parser is a frequent source of model frustration but, in the authors' view, is what makes long rollouts stable; without it, models routinely emit multi-command actions that interleave reasoning with execution and cause non-deterministic environment changes.^[3]

In v1.0, an alternative tool-bundle interface was added, in which each tool is registered with the LLM provider's native function calling schema. This works directly with providers that support strict JSON schemas (such as OpenAI's structured outputs) and falls back to a YAML-described tool grammar elsewhere through LiteLLM.^[2]

Sandboxing

Each task is executed inside a Docker image that has been pre-built with the target repository at the appropriate commit and with the project's test dependencies installed. SWE-bench provides those images for its task instances, and SWE-agent re-uses them directly. The agent's commands run inside the container; observations are returned as plain text. This isolation is what makes large parallel sweeps tractable: hundreds of containers can run in parallel without interfering with each other, and a misbehaving rollout cannot corrupt the host filesystem.^[1]^[10]

Cost accounting

Every LLM call is logged with token counts and dollar costs (using LiteLLM-reported per-provider prices), so a benchmark run produces a single CSV per task with all costs, latencies, and turn counts. This bookkeeping is what makes the published numbers reproducible: a researcher can not only repeat the benchmark but also confirm that they spent a comparable amount of money doing so.^[2]

How does SWE-agent perform on SWE-bench?

When SWE-agent landed on April 2, 2024, the public SWE-bench leaderboard was effectively bare: the only entries were the unagentic baselines from the original SWE-bench paper and a handful of internal results.^[7] SWE-agent's 12.47% resolve rate on the full set (the 2,294-issue SWE-bench test split) with GPT-4 Turbo vaulted it to the top of the open leaderboard, sitting only a small margin below Cognition AI's closed Devin result of 13.86% (which was measured on a random 25% subset rather than the full test set, complicating direct comparison).^[8]^[9] The paper reports this as a substantial jump over the previous best non-interactive resolve rate of 3.8%, and notes that the agent is portable across models: SWE-agent with Claude 3 Opus solved 10.5% of the benchmark tasks using the same harness.^[3]

The headline results from the original paper are summarized below:^[3]

Metric	Score	Base model
SWE-bench full (pass@1)	12.47% (rounded to 12.5%)	GPT-4 Turbo
SWE-bench full (pass@1), portability check	10.5%	Claude 3 Opus
Previous best non-interactive baseline	3.8%	RAG pipeline
ACI vs. plain Linux shell (SWE-bench Lite, 300 instances)	+10.7 percentage points	held constant
HumanEvalFix (pass@1)	87.7%	GPT-4 Turbo

The 87.7% pass@1 on HumanEvalFix, a smaller bug-fixing benchmark, demonstrates that the agent's gains transferred to a non-SWE-bench setting.^[3] Ablations in the paper, holding the base model constant, showed that removing the file viewer linter or expanding the viewer window from 100 lines to a much larger size each cost several percentage points on SWE-bench Lite, providing direct evidence that the gains came from interface design rather than from any change in the underlying language model.^[3]

Beyond GPT-4 Turbo, the original paper reports baseline numbers for Claude 3 Opus and a small open-weights model, in each case using the same harness and the same prompts. The general pattern was monotonic: stronger base models extracted more value from the ACI, but the ACI itself helped every model relative to a raw shell baseline, including weaker ones. This was the empirical basis for the paper's central claim that interface design is complementary to model scale rather than a substitute for it.^[3]

Over the following months SWE-agent itself was overtaken on the leaderboard, but mostly by systems that inherited its template:

In late 2024 Claude 3.5 Sonnet dropped, and re-runs of SWE-agent with the newer base model substantially raised the headline number.
In February 2025, SWE-agent 1.0 paired with Claude 3.7 Sonnet retook the state of the art on SWE-bench Verified on February 25, and on SWE-bench Full on February 28.^[2]
By 2026, the absolute lead on SWE-bench Verified had moved to frontier-model entries from OpenAI and Anthropic running through various agent harnesses, with figures in the high 80s; the original SWE-agent scaffold remained a competitive open-source baseline rather than the outright leader.^[12]

The longer-term impact is structural rather than numeric. SWE-agent provided the first openly reproducible recipe for the "agent + SWE-bench" loop, and most subsequent SWE-bench submitters (whether academic or industrial) have either forked SWE-agent's harness directly, copied its tool design, or built a near-equivalent ACI of their own. A 2025 survey of the SWE-bench leaderboards classified the majority of submitted systems as either SWE-agent forks or close architectural cousins.^[13]

Variants

SWE-agent EnIGMA

EnIGMA, short for Enhanced Interactive Generative Model Agent, is a port of SWE-agent to offensive cybersecurity Capture-the-Flag (CTF) challenges, released as SWE-agent v0.7 in September 2024.^[11] The accompanying paper, "EnIGMA: Interactive Tools Substantially Assist LM Agents in Finding Security Vulnerabilities" (arXiv:2409.16165), was led by researchers at Tel Aviv University, NYU Tandon, and Princeton, and was accepted at ICML 2025.^[14]

EnIGMA's main technical contribution is the Interactive Agent Tool concept: a wrapper that lets the agent drive a long-running interactive program (a debugger, a network client) over multiple turns while preserving its state, rather than restarting the program each turn. The authors also identify and name a failure mode they call "soliloquizing," in which models hallucinate tool outputs without actually invoking the tool, and propose interface-level mitigations.^[14] On the NYU CTF benchmark and Intercode-CTF, EnIGMA reportedly solves more than three times as many challenges as the previous best agent, with a 13.5% pass rate across roughly 200 NYU CTF challenges and state-of-the-art results across the four benchmarks (totalling 390 challenges) evaluated in the paper.^[14]

SWE-agent-LM

SWE-agent-LM is an open-weights fine-tune of Qwen 2.5 Coder Instruct 32B that is trained to drive the SWE-agent harness. It was released alongside the SWE-smith data pipeline in v1.1.0 (May 22, 2025).^[2]^[5]

SWE-smith itself ("SWE-smith: Scaling Data for Software Engineering Agents," arXiv:2504.21798) is a paper by John Yang and collaborators that describes a recipe for synthesizing software engineering training tasks at scale from arbitrary Python codebases.^[5] Using the pipeline, the authors generated 50,000 task instances from 128 GitHub repositories, an order of magnitude larger than prior datasets, then collected about 5,000 expert agent trajectories with Claude 3.7 Sonnet running inside SWE-agent. Fine-tuning Qwen 2.5 Coder 32B on those trajectories yielded SWE-agent-LM-32B, which achieves 40.2% pass@1 on SWE-bench Verified and was, at release, the new state of the art for open-weight models.^[5] The work was accepted as a NeurIPS 2025 Datasets and Benchmarks spotlight.^[5]

mini-SWE-agent

mini-SWE-agent is a deliberately minimal rewrite released in July 2025.^[6] The project's stated thesis is that most of SWE-agent's tool scaffolding can be removed without large performance losses, provided the base model is strong enough to drive a raw bash shell on its own. The agent class is roughly 100 lines of Python; it has no custom file viewer, no custom edit command, no built-in linter feedback loop, and no stateful shell; each tool call is an independent subprocess.run invocation, and history is a flat message list.^[6]

Despite this, mini-SWE-agent has reported a score above 74% on SWE-bench Verified with a strong base model, and it has displaced the full SWE-agent as the team's recommended starting point for new users.^[6] The repository's README describes it as "Radically simple, no huge configs, no giant monorepo, but scores >74% on SWE-bench verified," and lists adoption by Meta, NVIDIA, Essential AI, IBM, Nebius, Anyscale, Princeton University, and Stanford University as scaffolds in their own SWE-bench evaluations.^[6]

The mini-SWE-agent design reflects a broader shift in the agent research community sometimes referred to as the "minimum viable agent" or "scaffolding minimalism" trend. The argument runs: as frontier models have absorbed more shell, editing, and code-reading capability into their pretraining and post-training distributions, the marginal value of a hand-built ACI has decreased; what is left is essentially a thin parser around a single shell call. The original SWE-agent paper, by contrast, was written in an era when GPT-4 could not reliably drive a raw shell over thirty turns; the ACI was, in effect, a crutch that made the agent loop possible at all. Mini-SWE-agent is a deliberate test of how much of that crutch can be removed.^[6]

Significance

SWE-agent is significant in three connected ways.

First, it provided the first openly reproducible scaffold for the SWE-bench task, which is widely treated as the canonical benchmark for "real software engineering" by language model agents. Before SWE-agent, SWE-bench had no public agentic baseline; afterward, every major frontier model has a SWE-bench number, and almost all of those numbers were obtained inside SWE-agent or a clear derivative.^[13]

Second, it crystallized the Agent-Computer Interface concept that has since spread well beyond software engineering. The same design pressures (bounded outputs, abstract navigation, structured edit feedback, fast in-loop validation) reappear in agents for the web, for spreadsheets, for computer use, and for cybersecurity. The EnIGMA paper makes the generalization explicit by transferring the ACI to a different domain and getting matching gains.^[14]

Third, it served as a research vehicle. The same Princeton group used SWE-agent as the harness for SWE-bench Verified evaluations, for the SWE-smith data pipeline, for SWE-agent-LM, and for mini-SWE-agent; the rapid cadence of follow-up papers and tools all sharing a code base reflects the project's deliberate research-friendly design.^[5]^[6]

A fourth, more sociological consequence is that SWE-agent normalized the practice of evaluating coding agents on a held-out, executable test set rather than on toy puzzles. The pre-SWE-bench evaluation tradition (HumanEval, MBPP, APPS) measured a model's ability to write a small self-contained function; the post-SWE-agent tradition treats coding as a multi-turn, repository-scale activity in which the model must read, edit, and test code it did not write. This shift in evaluation methodology has, in turn, shifted what major labs report when they release a model. By 2026, almost every frontier-model launch (from OpenAI, Anthropic, Google DeepMind, xAI, and others) features a SWE-bench Verified number among its headline benchmarks, where in 2023 such a number would have been unusual.^[12]

How does SWE-agent compare with other coding agents?

SWE-agent sits in a family of LLM-driven coding and task agents, which can be roughly partitioned along two axes: how much custom tool scaffolding they layer on top of the base model, and whether they target benchmarks, developer workflows, or end-user product use. Several of the systems most often compared with SWE-agent are summarized below; this list is illustrative rather than exhaustive.

System	Type	Tool philosophy	Source	Primary benchmark posture
SWE-agent	Research framework	Custom ACI: file viewer, scoped search, linted edit	Open (MIT)^[1]	Native SWE-bench baseline
mini-SWE-agent	Research framework	Minimalist: shell + LM only^[6]	Open (MIT)	SWE-bench Verified
OpenHands (formerly OpenDevin)	Open developer platform	Sandboxed shell + web browser + code tools	Open	SWE-bench leader among open systems^[15]
Devin (Cognition AI)	Commercial product	Long-horizon planner + browser + shell	Closed	SWE-bench evaluations + product^[8]
Aider	Developer CLI	Git-aware edits with repo map	Open	Aider Polyglot, occasional SWE-bench
Claude Code	Commercial CLI (Anthropic)	Native tool use over local FS + bash	Closed (CLI)	Internal SWE-bench numbers
OpenAI Codex CLI	Commercial CLI (OpenAI)	Native function calling over local FS + bash	Closed	Internal SWE-bench numbers
Cline	VSCode agent	IDE-integrated edit/run loop	Open	Aider Polyglot, anecdotal
Cursor	IDE	Inline + agent mode with custom tools	Closed	Anecdotal
Auto-GPT	Early autonomous agent	General-purpose tools, web + memory	Open	Pre-SWE-bench, no canonical number
BabyAGI	Early task agent	Task list + memory loop	Open	Pre-SWE-bench, no canonical number

The most direct contrast is with Auto-GPT and BabyAGI from 2023, which popularized the basic agent loop pattern but operated with general-purpose tools and no benchmark scaffold; SWE-agent kept the loop but discarded the general-purpose tools in favor of a specialized ACI, and bound the system to a hard benchmark.^[3] The most direct contrast in the other direction is with Devin, whose closed nature triggered the demand that SWE-agent met, and with OpenHands, a contemporaneous open project that pursued a broader developer-platform vision rather than a research-baseline vision.^[15]

The minimalist branch of the family, exemplified by mini-SWE-agent, Aider, and to some extent Claude Code and Codex CLI, deliberately reverses the ACI design pressure: as base models have gotten stronger, the value of a custom tool layer has shrunk relative to giving the model a raw shell and a clear prompt. SWE-agent itself acknowledges this trend by recommending mini-SWE-agent as the default entry point for new users.^[5]^[6]

The IDE-embedded branch (Cursor, Cline, GitHub Copilot in agent mode) shares SWE-agent's tool-call loop but adds two extra constraints: the user is in the loop, providing approval or rejection of each diff; and the agent's tools are tied to editor primitives (open file in tab, run linter, run debugger) rather than to a generic shell. Empirically, the two approaches are converging: IDE agents have grown background autonomous modes, and benchmark agents like SWE-agent have grown interactive trajectory viewers. The differences between them are increasingly about who the user is (a developer at a keyboard vs. a benchmark harness) rather than about underlying mechanism.

Within the academic agent literature, SWE-agent occupies a particular slot: it is the canonical example used to explain what an Agent-Computer Interface is. Subsequent papers proposing new ACIs for other domains (the EnIGMA paper for cybersecurity, the SWE-smith paper for data generation, various GUI-agent papers) cite SWE-agent as the prior art that defined the design space. This is partly a function of timing (it was the first widely-distributed open ACI) and partly a function of the paper's clarity in stating the ACI thesis as a generalizable design principle.^[13]^[14]

Limitations

The paper and subsequent commentary identify a number of limitations that apply both to SWE-agent specifically and to the SWE-bench-driven agent research program more broadly.

Cost and latency. Each rollout makes many LLM calls (often 30 or more turns at full context); a full SWE-bench Verified sweep with a frontier model can run into the thousands of dollars and take hours, even with SWE-ReX-style parallelism.^[2]
Benchmark overfitting. Because the entire community has converged on SWE-bench, there is concern that systems are being co-tuned to its idiosyncrasies (Python only, test-suite-graded, repository-snapshotted) and that headline numbers exaggerate real-world coding ability. The Princeton group itself released SWE-bench Verified and SWE-bench Pro in part to address these concerns.^[12]
Tool brittleness. Early SWE-agent versions silently failed when the model produced a slightly malformed tool call; later releases added retries and tool-bundle validation, but the regex-parsed action format remains less robust than native function calling for some model providers.^[2]
Domain narrowness. The ACI was designed for Python repositories with test suites and pull requests; transferring it to other languages, to projects without tests, or to truly novel codebases is non-trivial. EnIGMA's extension to CTF challenges required a separate research effort, not a configuration change.^[14]
"Soliloquizing" and hallucinated observations. Identified in the EnIGMA paper, this failure mode (where the model invents an observation that was never produced by a tool) can also occur in plain SWE-agent runs; current mitigations rely on prompt design and stricter parsers rather than a fundamental fix.^[14]

SWE-bench, the benchmark the system was designed to attack.
SWE-bench Verified, the human-vetted subset.
Agentic workflow, the broader pattern of iterative LLM-driven action loops.
Tool use and Function calling, the underlying capability SWE-agent exploits.
ReAct (prompting), an earlier "thought + action" pattern co-authored by SWE-agent co-author Shunyu Yao.
Minimum Viable Agent, the design philosophy underlying mini-SWE-agent.

References

SWE-agent project, "SWE-agent: GitHub repository", GitHub, 2026-01-01. https://github.com/SWE-agent/SWE-agent. Accessed 2026-05-20. ↩
SWE-agent project, "Releases", GitHub, 2025-05-22. https://github.com/SWE-agent/SWE-agent/releases. Accessed 2026-05-20. ↩
John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, Ofir Press, "SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering", arXiv, 2024-05-06. https://arxiv.org/abs/2405.15793. Accessed 2026-05-20. ↩
John Yang et al., "SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering", Advances in Neural Information Processing Systems 37 (NeurIPS 2024), 2024-12-01. https://proceedings.neurips.cc/paper_files/paper/2024/file/5a7c947568c1b1328ccc5230172e1e7c-Paper-Conference.pdf. Accessed 2026-05-20. ↩
John Yang et al., "SWE-smith: Scaling Data for Software Engineering Agents", arXiv, 2025-04-30. https://arxiv.org/abs/2504.21798. Accessed 2026-05-20. ↩
SWE-agent project, "mini-swe-agent: The 100 line AI agent that solves GitHub issues", GitHub, 2025-07-24. https://github.com/SWE-agent/mini-swe-agent. Accessed 2026-05-20. ↩
Carlos E. Jimenez et al., "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?", arXiv, 2023-10-10. https://arxiv.org/abs/2310.06770. Accessed 2026-05-20. ↩
Tanay Vardhan, "Open-Source AI SWE-Agent Takes on Devin (Better Alternative?)", FavTutor, 2024-04-08. https://favtutor.com/articles/swe-agent-devin-alternative/. Accessed 2026-05-20. ↩
AI Agent Index (MIT Media Lab), "SWE-Agent", aiagentindex.mit.edu, 2024-08-01. https://aiagentindex.mit.edu/swe-agent-2/. Accessed 2026-05-20. ↩
SWE-agent project, "Agent tools", SWE-agent documentation v1.0, 2025-02-13. https://swe-agent.com/1.0/background/aci/. Accessed 2026-05-20. ↩
SWE-agent project, "Background", SWE-agent documentation v0.7, 2024-09-25. https://swe-agent.com/0.7/background/. Accessed 2026-05-20. ↩
vals.ai, "SWE-bench Verified leaderboard", 2026-04-01. https://www.vals.ai/benchmarks/swebench. Accessed 2026-05-20. ↩
Authors of leaderboard survey, "Dissecting the SWE-Bench Leaderboards: Profiling Submitters and Architectures of LLM- and Agent-Based Repair Systems", arXiv, 2025-06-22. https://arxiv.org/abs/2506.17208. Accessed 2026-05-20. ↩
Talor Abramovich et al., "EnIGMA: Interactive Tools Substantially Assist LM Agents in Finding Security Vulnerabilities", arXiv, 2024-09-24. https://arxiv.org/abs/2409.16165. Accessed 2026-05-20. ↩
ToolHalla, "Devin vs OpenHands vs SWE-agent: Top AI Coding Agents 2026", toolhalla.ai, 2026-01-15. https://toolhalla.ai/blog/devin-vs-openhands-vs-swe-agent-2026. Accessed 2026-05-20. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

Autonomous coding ChatDev Devstral GSO Harness (AI)Multi-SWE-bench SWE-bench Verified τ-bench

What is SWE-agent?

Infobox

History

Origins at Princeton NLP

Paper and NeurIPS 2024

Release cadence

What is the Agent-Computer Interface (ACI)?

Tools

How does SWE-agent work?

Action format

Sandboxing

Cost accounting

How does SWE-agent perform on SWE-bench?

Variants

SWE-agent EnIGMA

SWE-agent-LM

mini-SWE-agent

Significance

How does SWE-agent compare with other coding agents?

Limitations

Related concepts

See also

References

Improve this article

Related Articles

Cline (AI coding agent)

Roo Code

Gemini CLI

OpenHands

opencode (SST)

MetaGPT

What links here

Related Articles

Cline (AI coding agent)

Roo Code

Gemini CLI

OpenHands

opencode (SST)

MetaGPT

What links here