AgentDojo

AI Agents AI Benchmarks AI Safety

10 min read

Updated Jun 9, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 9, 2026

Fact-checked

In review queue

Sources

8 citations

Revision

v2 · 2,081 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

AgentDojo is a dynamic evaluation environment for measuring prompt injection attacks and defenses against tool-using large language model agents. It was introduced in June 2024 by researchers at the Secure, Private, and Reliable Machine Learning Laboratory (SPY Lab) at ETH Zurich, and the accompanying paper was accepted to the NeurIPS 2024 Datasets and Benchmarks Track. Rather than shipping a fixed list of test cases, AgentDojo provides a programmable framework in which realistic agent tasks, adversarial injection goals, attacks, and defenses can each be added and combined, so that the security of an AI agent can be measured jointly with its usefulness.^[1]^[2]^[3]

Overview

AgentDojo targets a specific failure mode of agents that call external tools: when an agent reads data returned by a tool, such as the body of an email, the contents of a web page, or a chat message, that data may contain text planted by an attacker that hijacks the agent into performing actions the user never requested. The benchmark populates four simulated application environments with 97 realistic user tasks and 27 adversarial injection tasks, and from their combination it derives 629 security test cases that probe whether an agent can be steered off course while still being asked to complete legitimate work.^[1]^[2]^[4]

The authors frame AgentDojo explicitly as "not a static test suite, but rather an extensible environment for designing and evaluating new agent tasks, defenses, and adaptive attacks." This design reflects the observation that security benchmarks become stale quickly: a fixed set of injection strings is easy to overfit to, whereas a framework that lets defenders and attackers add new components can keep pace as both sides improve. The code is released under the ethz-spylab/agentdojo repository, and a results table plus documentation are hosted at agentdojo.spylab.ai.^[1]^[2]^[5]

Motivation: prompt injection against agents

A tool-using agent interleaves model reasoning with calls to external functions, then feeds the returned data back into the model's context to decide what to do next. That returned data is frequently untrusted. An email, a calendar invite, a Slack message, or a product review can all originate from a third party, and any of them can carry instructions aimed at the model rather than at the human reader. When the model treats this injected text as a command, the result is a prompt injection attack, and in the agentic setting the consequences extend beyond a misleading answer to concrete actions such as sending money, forwarding private files, or deleting records.^[1]^[2]

Before AgentDojo, much of the prompt-injection literature evaluated attacks on single-turn chat prompts or on narrow, hand-built scenarios, which made it hard to compare attacks and defenses on equal footing or to measure the practical cost a defense imposes on ordinary use. AgentDojo was built to close that gap by giving the field a shared, reproducible environment where the same agent can be scored both on how many benign tasks it completes and on how often an attacker reaches a malicious goal. The work sits within the broader agenda of AI safety and security for autonomous systems, where the central worry is that increasingly capable agents acting on real accounts and APIs will also become higher-value targets.^[1]^[2]

The environment and task design

AgentDojo is organized around the idea of a task suite. Each suite is a self-contained application world with its own set of tools, its own user tasks, and its own injection tasks. A user task is a benign instruction a person might genuinely give, paired with a checker that verifies whether the agent accomplished it. An injection task is an adversarial objective an attacker would like to achieve, paired with its own checker that verifies whether the attacker succeeded. The suites are deterministic and run against simulated back ends rather than live services, so results are reproducible.^[1]^[2]

The four suites in the original release are summarized below.^[1]^[4]^[6]

Suite	Tools	User tasks	Injection tasks	Example domain
Workspace	24	40	6	Email, calendar, and cloud document management ^[1]^[6]
Slack	11	21	5	Team messaging, web browsing, and file handling ^[1]^[6]
Travel	28	20	7	Flight, hotel, and restaurant booking ^[1]^[6]
Banking	11	16	9	Payments and account management ^[1]^[6]
Total	70	97	27	Four application environments ^[1]^[2]

To turn these into security test cases, AgentDojo takes the cross product of user tasks and injection tasks within each suite: for every legitimate task, the attacker's chosen goal is injected into the data the agent encounters while solving it. Summed across the four suites, this yields the 629 security test cases reported in the paper. A single run can therefore report an agent's benign utility on the 97 user tasks and, separately, its security across the 629 attacked configurations.^[1]^[2]^[4]

Attacks and defenses evaluated

A key design choice is that attacks and defenses are pluggable. An attack in AgentDojo is a function that decides how and where to place the injection task's payload inside the untrusted data, and a defense is a transformation applied to the agent pipeline, for example a filter on tool outputs or an extra instruction in the system prompt. Because both are code, researchers can drop in adaptive attacks that are tuned against a specific defense, which the authors argue is essential for honest security evaluation.^[1]^[2]

The paper evaluates several attack templates from the literature. The strongest baseline, called "Important instructions" (also referred to as the important-message attack), wraps the malicious goal in text that impersonates the system or the user and claims the new instruction takes priority. Other tested variants include an "Ignore previous instructions" style attack, a terse "TODO" injection, a prompt drawn from the InjecAgent benchmark, and a tool-knowledge attack that adds plausible metadata such as the user's or model's name to make the injection more convincing.^[1]^[4]

On the defense side, AgentDojo ships and measures several mitigations, including:^[1]^[4]^[7]

Defense	Mechanism
Tool filter	An LLM pre-selects which tools may be used before any untrusted data is read, limiting the actions an attacker can trigger ^[1]^[4]
Prompt sandwiching (repeat user prompt)	The user's instruction is restated after each tool call so the original goal stays salient ^[1]^[4]
Data delimiters / spotlighting	Tool outputs are wrapped in special markers, with the model told to treat the enclosed text as data, not instructions ^[1]^[4]
Prompt-injection detector	A transformer classifier based on DeBERTa inspects tool outputs and aborts when it flags an injection ^[1]^[4]

Metrics

AgentDojo reports three headline metrics that together capture the utility-versus-security trade-off.^[1]^[2]^[3]

Benign utility: the fraction of the 97 user tasks the agent solves when no attack is present. This measures raw usefulness.
Utility under attack: the fraction of security cases in which the agent still completes the legitimate user task while not carrying out the attacker's goal. A defense that breaks normal behavior shows up here as a drop relative to benign utility.
Targeted attack success rate (ASR): the fraction of security cases in which the attacker's specific goal is achieved. Lower is more secure.

Reporting all three at once is the point of the framework. A defense that drives ASR to near zero is worthless if it also collapses utility, and a model that is highly capable on benign tasks may still be dangerous if its ASR is high. AgentDojo makes both halves of that picture visible in a single evaluation.^[1]^[2]^[3]

Notable results

In the original paper, no model tested came close to solving the benchmark cleanly even without any adversary. The abstract states plainly that "current LLMs solve less than 66% of AgentDojo tasks in the absence of any attack," underscoring that tool-using agents were still unreliable on ordinary multi-step work. On the security side, the authors report that "our attacks succeed against the best performing agents in less than 25% of cases," a result they present as encouraging for defenders but far from a solved problem.^[1]

The paper also documents a counterintuitive pattern it calls an inverse scaling effect: more capable models tend to be easier to attack, because a weaker model often fails to carry out the attacker's goal correctly even when the injection lands, whereas a stronger model competently executes whatever instruction it follows. Among the models studied, the authors note that GPT-4o was the strongest on benign utility while Claude 3.5 Sonnet was the most resistant to injections.^[1]^[3]

The project's results page reports figures for a range of agents using the important-instructions attack with no defense. Selected entries are shown below; the maintainers caution that this table is not a strict leaderboard, because not every model was run against every attack and defense, so cross-model comparisons should be read with care.^[6]

Model	Benign utility	Utility under attack	Targeted ASR
Claude 3.7 Sonnet	88.66%	77.27%	7.31% ^[6]
Claude 3.5 Sonnet (2024-10-22)	79.38%	72.50%	1.11% ^[6]
GPT-4o	69.07%	50.08%	47.69% ^[6]
Claude 3 Opus	68.04%	52.46%	11.29% ^[6]
Gemini 1.5 Pro	61.86%	47.06%	17.01% ^[6]

In the paper's own ablations on GPT-4o, the important-message attack reached a targeted ASR of 57.7% under no defense, and defenses changed the picture substantially. The tool-filter defense was the most effective mitigation studied, cutting GPT-4o's targeted ASR to roughly 6.8% while preserving most utility. The DeBERTa-based prompt-injection detector pushed ASR down to about 8% but degraded benign utility sharply, and the data-delimiter defense offered only partial protection, leaving ASR around 42%. These contrasts illustrate the framework's central message that strong security and high utility are difficult to obtain simultaneously.^[1]^[4]

Significance for AI security

AgentDojo became one of the reference benchmarks for agent security soon after release. Its combination of realistic multi-step tasks, an extensible attack and defense interface, and a joint utility-security scoring scheme made it a natural target for follow-up work proposing new injections and new guardrails, and it has been adopted into independent evaluation harnesses. The UK AI Safety Institute, for example, includes an AgentDojo implementation in its open-source Inspect Evals suite, and national safety institutes have used the environment as a basis for developing stronger attacks.^[5]^[8]

The benchmark's framing also shaped how the field discusses the problem. By demonstrating the inverse scaling effect and by quantifying the cost that defenses impose on ordinary tasks, AgentDojo reinforced the argument that prompt injection is a systems-level security issue for agents rather than a quirk to be patched at the prompt level, and that progress should be judged on the utility-security frontier rather than on attack resistance alone.^[1]^[3]

Limitations

AgentDojo's authors and later commentators note several constraints. The environments, while realistic in structure, are simulated and bounded to four domains, so they cannot capture every real-world tool, data source, or attacker capability. The benchmark's checkers verify task and attack success programmatically, which keeps evaluation cheap and deterministic but can miss partial successes or unconventional solution paths. Because the framework is extensible by design, headline numbers depend heavily on which attack and defense are configured, so results are only comparable when the exact attack, defense, and agent scaffold are held fixed. Finally, reported figures drift as new and stronger attacks are added, which is intended behavior for a living benchmark but means any single snapshot of robustness should be treated as provisional rather than final.^[1]^[6]^[8]

References

Debenedetti, Edoardo; Zhang, Jie; Balunović, Mislav; Beurer-Kellner, Luca; Fischer, Marc; Tramèr, Florian. "AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents." arXiv:2406.13352, June 2024. Full text: arxiv.org/html/2406.13352v3. ↩
Debenedetti, Edoardo; et al. "AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents." OpenReview, NeurIPS 2024 Datasets and Benchmarks Track. ↩
Invariant Labs. "AgentDojo: Jointly evaluate security and utility of AI agents." December 11, 2024. ↩
"AgentDojo Benchmark: LLM Security Evaluation." EmergentMind. Accessed June 2026. ↩
ETH Zurich SPY Lab. "ethz-spylab/agentdojo (GitHub repository)." Accessed June 2026. ↩
AgentDojo. "Results." agentdojo.spylab.ai. Accessed June 2026. ↩
AgentDojo. "Documentation and getting started." agentdojo.spylab.ai. Accessed June 2026. ↩
UK AI Safety Institute. "AgentDojo (Inspect Evals)." Accessed June 2026. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

AgentHarm

Overview

Motivation: prompt injection against agents

The environment and task design

Attacks and defenses evaluated

Metrics

Notable results

Significance for AI security

Limitations

See also

References

Improve this article

Related Articles

AgentHarm

Agent benchmark reward hacking

Frontier Model Forum

Humanity's Last Exam

METR

SimpleQA

What links here

Related Articles

AgentHarm

Agent benchmark reward hacking

Frontier Model Forum

Humanity's Last Exam

METR

SimpleQA