AgentDojo
Last reviewed
Jun 2, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 · 2,076 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 2, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 · 2,076 words
Add missing citations, update stale details, or suggest a clearer explanation.
AgentDojo is a dynamic evaluation environment for measuring prompt injection attacks and defenses against tool-using large language model agents. It was introduced in June 2024 by researchers at the Secure, Private, and Reliable Machine Learning Laboratory (SPY Lab) at ETH Zurich, and the accompanying paper was accepted to the NeurIPS 2024 Datasets and Benchmarks Track. Rather than shipping a fixed list of test cases, AgentDojo provides a programmable framework in which realistic agent tasks, adversarial injection goals, attacks, and defenses can each be added and combined, so that the security of an AI agent can be measured jointly with its usefulness.[1][2][3]
AgentDojo targets a specific failure mode of agents that call external tools: when an agent reads data returned by a tool, such as the body of an email, the contents of a web page, or a chat message, that data may contain text planted by an attacker that hijacks the agent into performing actions the user never requested. The benchmark populates four simulated application environments with 97 realistic user tasks and 27 adversarial injection tasks, and from their combination it derives 629 security test cases that probe whether an agent can be steered off course while still being asked to complete legitimate work.[1][2][4]
The authors frame AgentDojo explicitly as "not a static test suite, but rather an extensible environment for designing and evaluating new agent tasks, defenses, and adaptive attacks." This design reflects the observation that security benchmarks become stale quickly: a fixed set of injection strings is easy to overfit to, whereas a framework that lets defenders and attackers add new components can keep pace as both sides improve. The code is released under the ethz-spylab/agentdojo repository, and a results table plus documentation are hosted at agentdojo.spylab.ai.[1][2][5]
A tool-using agent interleaves model reasoning with calls to external functions, then feeds the returned data back into the model's context to decide what to do next. That returned data is frequently untrusted. An email, a calendar invite, a Slack message, or a product review can all originate from a third party, and any of them can carry instructions aimed at the model rather than at the human reader. When the model treats this injected text as a command, the result is a prompt injection attack, and in the agentic setting the consequences extend beyond a misleading answer to concrete actions such as sending money, forwarding private files, or deleting records.[1][2]
Before AgentDojo, much of the prompt-injection literature evaluated attacks on single-turn chat prompts or on narrow, hand-built scenarios, which made it hard to compare attacks and defenses on equal footing or to measure the practical cost a defense imposes on ordinary use. AgentDojo was built to close that gap by giving the field a shared, reproducible environment where the same agent can be scored both on how many benign tasks it completes and on how often an attacker reaches a malicious goal. The work sits within the broader agenda of AI safety and security for autonomous systems, where the central worry is that increasingly capable agents acting on real accounts and APIs will also become higher-value targets.[1][2]
AgentDojo is organized around the idea of a task suite. Each suite is a self-contained application world with its own set of tools, its own user tasks, and its own injection tasks. A user task is a benign instruction a person might genuinely give, paired with a checker that verifies whether the agent accomplished it. An injection task is an adversarial objective an attacker would like to achieve, paired with its own checker that verifies whether the attacker succeeded. The suites are deterministic and run against simulated back ends rather than live services, so results are reproducible.[1][2]
The four suites in the original release are summarized below.[1][4][6]
| Suite | Tools | User tasks | Injection tasks | Example domain |
|---|---|---|---|---|
| Workspace | 24 | 40 | 6 | Email, calendar, and cloud document management [1][6] |
| Slack | 11 | 21 | 5 | Team messaging, web browsing, and file handling [1][6] |
| Travel | 28 | 20 | 7 | Flight, hotel, and restaurant booking [1][6] |
| Banking | 11 | 16 | 9 | Payments and account management [1][6] |
| Total | 70 | 97 | 27 | Four application environments [1][2] |
To turn these into security test cases, AgentDojo takes the cross product of user tasks and injection tasks within each suite: for every legitimate task, the attacker's chosen goal is injected into the data the agent encounters while solving it. Summed across the four suites, this yields the 629 security test cases reported in the paper. A single run can therefore report an agent's benign utility on the 97 user tasks and, separately, its security across the 629 attacked configurations.[1][2][4]
A key design choice is that attacks and defenses are pluggable. An attack in AgentDojo is a function that decides how and where to place the injection task's payload inside the untrusted data, and a defense is a transformation applied to the agent pipeline, for example a filter on tool outputs or an extra instruction in the system prompt. Because both are code, researchers can drop in adaptive attacks that are tuned against a specific defense, which the authors argue is essential for honest security evaluation.[1][2]
The paper evaluates several attack templates from the literature. The strongest baseline, called "Important instructions" (also referred to as the important-message attack), wraps the malicious goal in text that impersonates the system or the user and claims the new instruction takes priority. Other tested variants include an "Ignore previous instructions" style attack, a terse "TODO" injection, a prompt drawn from the InjecAgent benchmark, and a tool-knowledge attack that adds plausible metadata such as the user's or model's name to make the injection more convincing.[1][4]
On the defense side, AgentDojo ships and measures several mitigations, including:[1][4][7]
| Defense | Mechanism |
|---|---|
| Tool filter | An LLM pre-selects which tools may be used before any untrusted data is read, limiting the actions an attacker can trigger [1][4] |
| Prompt sandwiching (repeat user prompt) | The user's instruction is restated after each tool call so the original goal stays salient [1][4] |
| Data delimiters / spotlighting | Tool outputs are wrapped in special markers, with the model told to treat the enclosed text as data, not instructions [1][4] |
| Prompt-injection detector | A transformer classifier based on DeBERTa inspects tool outputs and aborts when it flags an injection [1][4] |
AgentDojo reports three headline metrics that together capture the utility-versus-security trade-off.[1][2][3]
Reporting all three at once is the point of the framework. A defense that drives ASR to near zero is worthless if it also collapses utility, and a model that is highly capable on benign tasks may still be dangerous if its ASR is high. AgentDojo makes both halves of that picture visible in a single evaluation.[1][2][3]
In the original paper, no model tested came close to solving the benchmark cleanly even without any adversary. The abstract states plainly that "current LLMs solve less than 66% of AgentDojo tasks in the absence of any attack," underscoring that tool-using agents were still unreliable on ordinary multi-step work. On the security side, the authors report that "our attacks succeed against the best performing agents in less than 25% of cases," a result they present as encouraging for defenders but far from a solved problem.[1]
The paper also documents a counterintuitive pattern it calls an inverse scaling effect: more capable models tend to be easier to attack, because a weaker model often fails to carry out the attacker's goal correctly even when the injection lands, whereas a stronger model competently executes whatever instruction it follows. Among the models studied, the authors note that GPT-4o was the strongest on benign utility while Claude 3.5 Sonnet was the most resistant to injections.[1][3]
The project's results page reports figures for a range of agents using the important-instructions attack with no defense. Selected entries are shown below; the maintainers caution that this table is not a strict leaderboard, because not every model was run against every attack and defense, so cross-model comparisons should be read with care.[6]
| Model | Benign utility | Utility under attack | Targeted ASR |
|---|---|---|---|
| Claude 3.7 Sonnet | 88.66% | 77.27% | 7.31% [6] |
| Claude 3.5 Sonnet (2024-10-22) | 79.38% | 72.50% | 1.11% [6] |
| GPT-4o | 69.07% | 50.08% | 47.69% [6] |
| Claude 3 Opus | 68.04% | 52.46% | 11.29% [6] |
| Gemini 1.5 Pro | 61.86% | 47.06% | 17.01% [6] |
In the paper's own ablations on GPT-4o, the important-message attack reached a targeted ASR of 57.7% under no defense, and defenses changed the picture substantially. The tool-filter defense was the most effective mitigation studied, cutting GPT-4o's targeted ASR to roughly 6.8% while preserving most utility. The DeBERTa-based prompt-injection detector pushed ASR down to about 8% but degraded benign utility sharply, and the data-delimiter defense offered only partial protection, leaving ASR around 42%. These contrasts illustrate the framework's central message that strong security and high utility are difficult to obtain simultaneously.[1][4]
AgentDojo became one of the reference benchmarks for agent security soon after release. Its combination of realistic multi-step tasks, an extensible attack and defense interface, and a joint utility-security scoring scheme made it a natural target for follow-up work proposing new injections and new guardrails, and it has been adopted into independent evaluation harnesses. The UK AI Safety Institute, for example, includes an AgentDojo implementation in its open-source Inspect Evals suite, and national safety institutes have used the environment as a basis for developing stronger attacks.[5][8]
The benchmark's framing also shaped how the field discusses the problem. By demonstrating the inverse scaling effect and by quantifying the cost that defenses impose on ordinary tasks, AgentDojo reinforced the argument that prompt injection is a systems-level security issue for agents rather than a quirk to be patched at the prompt level, and that progress should be judged on the utility-security frontier rather than on attack resistance alone.[1][3]
AgentDojo's authors and later commentators note several constraints. The environments, while realistic in structure, are simulated and bounded to four domains, so they cannot capture every real-world tool, data source, or attacker capability. The benchmark's checkers verify task and attack success programmatically, which keeps evaluation cheap and deterministic but can miss partial successes or unconventional solution paths. Because the framework is extensible by design, headline numbers depend heavily on which attack and defense are configured, so results are only comparable when the exact attack, defense, and agent scaffold are held fixed. Finally, reported figures drift as new and stronger attacks are added, which is intended behavior for a living benchmark but means any single snapshot of robustness should be treated as provisional rather than final.[1][6][8]