BountyBench
Last reviewed
Jun 2, 2026
Sources
4 citations
Review status
Source-backed
Revision
v1 · 1,817 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 2, 2026
Sources
4 citations
Review status
Source-backed
Revision
v1 · 1,817 words
Add missing citations, update stale details, or suggest a clearer explanation.
BountyBench is a cybersecurity benchmark from Stanford University that measures the offensive and defensive capabilities of AI agents on real-world bug-bounty tasks, scoring performance in dollars of economic impact rather than success rate alone [1][2]. Introduced in May 2025 in the paper BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems, it evaluates agents across three task types, Detect, Exploit, and Patch, on 25 systems with complex codebases drawn from live bug-bounty programs, anchoring each task to the actual monetary award a human security researcher received for the corresponding vulnerability [1][3]. The benchmark was accepted to the NeurIPS 2025 Datasets and Benchmarks track [2].
BountyBench presents itself as the first framework to capture both offensive and defensive cyber-capabilities of AI agents in evolving real-world systems [1][2]. Where many security benchmarks reduce an agent's performance to a percentage of tasks solved, BountyBench maps each completed task back to the dollar value of the bug bounty it represents, producing an estimate of the economic footprint an agent could have as either an attacker or a defender [1][3].
The benchmark comprises 25 systems, each a real GitHub project that runs a bug-bounty program, and 40 bug bounties, which are vulnerabilities that earned monetary awards ranging from $10 to $30,485 [1][2][3]. Collectively the vulnerabilities cover 9 of the OWASP Top 10 web application security risks [1][2]. For every system the researchers manually reconstructed a working environment, installing packages, standing up servers, and hydrating databases so that agents operate against functioning software rather than static code snapshots [1][2].
BountyBench was produced by the Stanford Center for Research on Foundation Models (CRFM) and was funded by Open Philanthropy [3]. The paper lists 34 authors led by Andy K. Zhang and includes Percy Liang, Dan Boneh, and Dawn Song among its co-authors [2][3]. It is a direct successor to Cybench, an earlier capture-the-flag benchmark from the same group; the authors note that Cybench had already provided an empirical basis for government AI Safety Institute evaluations, and BountyBench extends that line of work by adding defensive patching tasks and the economic-impact framing [3].
The benchmark is framed around the dual-use nature of capable coding agents. A large language model agent that can navigate an unfamiliar codebase, reason about its attack surface, and write working code can function either as a hacking tool for attackers or as a penetration-testing and remediation tool for defenders [3]. As frontier models improve at autonomous software engineering, the authors argue, their potential to materially shift the offense-defense balance in cybersecurity grows, and that shift is poorly captured by abstract puzzle-style evaluations [1][3].
BountyBench is built to measure this concretely. By grounding tasks in real systems with real payouts, it aims to give model developers, evaluators, and policymakers an empirical basis for judging how much an agent could plausibly help discover, weaponize, or fix vulnerabilities, and at what scale of economic consequence [1][3]. The authors position public release as serving transparency, reproducibility, and the evidentiary needs of regulators, while acknowledging the dual-use risk that the same harness could guide attackers [3].
BountyBench decomposes the vulnerability lifecycle into three task types, which together trace a vulnerability from discovery through weaponization to remediation [1][2][3].
| Task | Objective | Information provided |
|---|---|---|
| Detect | Find a previously unknown vulnerability in the system | Minimal; the agent must locate a flaw with little or no guidance |
| Exploit | Weaponize a specific, documented vulnerability to demonstrate impact | Details from the corresponding bug-bounty report |
| Patch | Fix a documented vulnerability without breaking functionality | The vulnerability description; the patch must preserve correct behavior |
The Detect task is the hardest to evaluate, because there is no fixed answer key for an open-ended search. The authors construct a success indicator that is general across vulnerability types and provides localized evaluation, allowing the benchmark to credit an agent for surfacing a genuine flaw rather than only for matching a predefined string [1][2]. Difficulty on Detect is modulated by an information gradient: agents are run with no information, with the Common Weakness Enumeration (CWE) category supplied, and with progressively more detail. Performance is tightly clustered at the no-information and CWE levels and separates more clearly as additional guidance is added, which lets the benchmark distinguish agents that need hints from those that can find vulnerabilities cold [1][3].
The Patch task tests the defensive side of the same vulnerabilities. An agent must modify the codebase so the flaw is closed while the system continues to function, a constraint that rules out trivial fixes that simply disable affected features [1][2].
The distinguishing feature of BountyBench is that it scores agents in dollars. Each vulnerability carries the actual bounty award paid to the human who reported it, between $10 and $30,485, and an agent that successfully detects or patches a vulnerability is credited with that award [1][2][3]. The benchmark reports, alongside conventional success rate and token cost, a Bounty Total, defined as the sum of the bounty awards of the tasks an agent completes [3][4]. This converts capability into an interpretable estimate of value created or value at risk.
Aggregated across the evaluated agents, the reported dollar totals illustrate the asymmetry the benchmark is designed to expose. On Detect, agents collectively completed $5,855 in bounty value with no additional information, rising to $10,275 when the CWE category was supplied [3]. On Patch, agents collectively completed $47,821 in bounty value, reflecting the much stronger defensive performance of current systems [3]. The single best individual results map smaller but still notable sums to specific agents, for example $3,720 for the top Detect agent and $14,422 for a top Patch agent [2].
The 25 systems are real codebases selected from projects that operate public bug-bounty programs, which is what makes the dollar awards available and verifiable [1][3]. Because the environments are reconstructed by hand, each presents the full complexity of a deployed application, including its dependencies, running services, and backing data [1][2]. The 40 vulnerabilities span 9 of the 10 categories in the OWASP Top 10, the widely used taxonomy of web application security risks, giving the benchmark breadth across common real-world weakness classes rather than concentrating on a single bug type [1][2].
Agents are given up to three attempts per task, and results are reported under that budget [1][2][4]. The harness records success rate and token cost per task, and the bounty-weighted dollar total for Detect and Patch [3][4]. For Detect specifically, the information-gradient protocol (no info, CWE, and richer report detail) is used to probe how much scaffolding each agent requires [1][3].
The evaluated systems include both agentic command-line coding tools and custom agents built by the authors around frontier models. The lineup reported across the paper and project materials comprises Claude Code, OpenAI Codex CLI (run with o3-high and o4-mini), and custom agents wrapping o3-high, GPT-4.1, Gemini 2.5 Pro Preview, Claude 3.7 Sonnet Thinking, Qwen3 235B A22B, Llama 4 Maverick, and DeepSeek-R1 [1][2]. These cover models from OpenAI, Anthropic, Google DeepMind via Gemini, Alibaba via Qwen, Meta via Llama, and DeepSeek.
The headline finding is a clear offense-defense gap. The strongest agentic coding tools are markedly better at patching vulnerabilities than at exploiting them, whereas the custom agents are more balanced between offense and defense [1][2]. Reported figures for selected agents are summarized below; percentages are success rates under up to three attempts, and dollar figures are bounty-weighted totals where stated [1][2][3].
| Agent | Exploit success | Patch success | Notes |
|---|---|---|---|
| OpenAI Codex CLI (o3-high) | 47.5% | 90% | Defense-leaning; strong patcher [2] |
| OpenAI Codex CLI (o4-mini) | 32.5% | 90% | Top Patch result, mapping to $14,422 [2] |
| Claude Code | 57.5% | 87.5% | Defense-leaning agentic tool [2] |
| Custom agent: Claude 3.7 Sonnet Thinking | 67.5% | up to ~60% (custom-agent range) | Best reported Exploit success [2] |
| Custom agents (range) | 40 to 67.5% | 45 to 60% | Relatively balanced offense and defense [1][3] |
On the Detect task, which is the most demanding, the best single result reported is the Codex CLI agent running o3-high at 12.5% success, mapping to $3,720 in bounty value [2]. Across the field, Detect success rates are low and heavily clustered until extra information is supplied, underscoring that autonomously discovering novel vulnerabilities remains substantially harder for current agents than exploiting or patching known ones [1][3].
BountyBench advances AI security evaluation in two ways. First, it grounds agent capability in real, running systems with documented monetary stakes, moving beyond synthetic capture-the-flag puzzles toward conditions that resemble actual security work [1][3]. Second, its dollar-denominated scoring gives evaluators and policymakers a tangible unit for reasoning about impact, complementing the abstract success rates that dominate other benchmarks [1][2].
For AI safety and governance, the benchmark's most policy-relevant result is the present asymmetry: current agents are considerably more effective at defending and patching than at autonomously finding and weaponizing new vulnerabilities, which is, for now, a reassuring posture for defenders [1][3]. The authors are explicit about the dual-use tension in releasing such a harness and justify publication on grounds of transparency, reproducibility, and the evidentiary needs of regulators, echoing the role its predecessor Cybench played in informing government evaluation bodies [3]. Within the broader landscape of agentic coding and security benchmarks, BountyBench complements code-focused suites such as SWE-bench by targeting the adversarial dimension of software work rather than ordinary issue resolution [1].
The benchmark's realism comes at the cost of scale. Because every environment is reconstructed by hand, the suite is limited to 25 systems and 40 bounties, a small set relative to the diversity of real-world software, and expanding it requires substantial manual effort [1][2]. The dollar figures are inherited from historical bounty payouts, which vary widely with program generosity and disclosure timing and are therefore an approximate rather than precise measure of economic impact [1][3]. Detect performance in particular is sensitive to how much information the harness supplies, so reported success on that task depends on the chosen information regime [1][3]. As with any capability evaluation, results reflect the specific agents, scaffolds, and attempt budget tested and may shift quickly as models and agent frameworks improve.