Cybench
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,152 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,152 words
Add missing citations, update stale details, or suggest a clearer explanation.
Cybench (short for Cybersecurity benchmark) is an open-source evaluation framework for measuring the cybersecurity capabilities and risks of large language model agents. It comprises forty professional-level capture-the-flag (CTF) challenges drawn from four high-profile cybersecurity competitions, together with seventeen tasks that have been further decomposed into intermediate subtasks for fine-grained scoring. The benchmark was released in August 2024 by a team of researchers based primarily at stanford university, with additional contributors from uc berkeley and the wider security community, and was subsequently accepted to the Thirteenth International Conference on Learning Representations (iclr) in 2025.[^1][^2]
The framework was designed to address what its authors describe as a gap between earlier CTF-style benchmarks aimed at educational or amateur audiences (such as InterCode-CTF and NYU CTF Bench) and the kinds of professional-grade challenges used by industry practitioners. Each Cybench task is scaffolded so that an autonomous agent runs inside a cybersecurity sandbox -- a Kali Linux Docker container with full shell access -- and must locate a hidden flag string by exploiting vulnerabilities across web applications, cryptographic primitives, binary executables, network protocols, and forensic artefacts.[^3]
Since its release, Cybench has become one of the most widely cited cyber-offense benchmarks for frontier language models. It is the only open-source cyber benchmark included in the joint pre-deployment tests run by the uk aisi and us aisi on Anthropic's Claude 3.5 Sonnet and OpenAI's o1, and appears as a recurring evaluation in the system cards of anthropic, openai, google deepmind, xai, and amazon frontier models.[^4][^5][^6] Its prominence has also made it a target of saturation, with vendor-reported scores climbing from around 17.5 percent in the original 2024 study to near-ceiling figures in 2025-2026 system cards, prompting a wave of larger successor benchmarks.
| Attribute | Detail |
|---|---|
| Full name | Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models |
| Initial release | 15 August 2024 (arXiv preprint v1) |
| Publication venue | ICLR 2025 |
| Lead author | Andy K. Zhang (stanford university) |
| Senior authors | Dan Boneh, Daniel E. Ho, percy liang |
| Total authors | 27 |
| arXiv identifier | 2408.08926 |
| Code repository | github.com/andyzorigin/cybench |
| Project website | cybench.github.io |
| License | Apache 2.0 |
| Total tasks | 40 |
| Subtask-equipped tasks | 17 |
| Source competitions | HackTheBox Cyber Apocalypse 2024, SekaiCTF 2022 and 2023, Glacier 2023, HKCert 2023 |
| Categories | Cryptography, web exploitation, reverse engineering, binary exploitation (pwn), forensics, miscellaneous |
| Human first-solve range | ~2 minutes to ~24 hours 54 minutes |
| Execution environment | Kali Linux Docker container |
| Recognition | First Prize, SafeBench (Center for AI Safety, 2024) |
By mid-2024 it had become clear that frontier language models, when wrapped in tool-using agent scaffolds, could carry out non-trivial chains of programming and shell actions. Empirical work on offensive cyber capability lagged this trend: existing benchmarks were largely composed of beginner-level challenges. InterCode-CTF, the most widely used predecessor, was built from picoCTF challenges intended for high-school competitors and had an average human solve time of roughly three and a half minutes. NYU CTF Bench, released in June 2024, broadened coverage to university-level CSAW tasks but still concentrated on tasks substantially below the difficulty seen at flagship competitions.[^7]
The Cybench team -- centred at the Stanford Center for Research on Foundation Models (CRFM) and the Stanford RegLab -- argued that a credible cyber-risk evaluation should target professional-level competitions whose first-blood solve times by elite human teams already span minutes to days, since such tasks are closer to the difficulty of real-world penetration testing and vulnerability research. They further argued that purely binary success-or-failure scoring discards too much information at the upper end of the difficulty curve, where most models still fail outright; subtask-level scoring would let evaluators track partial progress and detect emerging capabilities before models saturate end-to-end performance.[^1][^2]
A secondary motivation, repeatedly emphasised in the paper, was the dual-use nature of automated offensive cyber capability. The authors framed Cybench as a tool for policymakers, frontier labs, and external evaluators tracking how quickly autonomous exploit-finding ability is developing, rather than as a leaderboard for hackers. This framing aligned the benchmark with the cyber-risk threshold work being conducted under the responsible scaling policy of anthropic, the preparedness framework of openai, and the frontier safety framework of google deepmind.
The forty tasks in Cybench are drawn from four competitions selected for being recent, having strong public reputations, and supplying challenges across the full standard CTF category set.[^2]
Each task is tagged with one of six standard CTF category labels:
The Cybench paper emphasises first-blood time -- the time elapsed between the start of a competition and the first successful flag submission by any human team -- as an objective, externally validated proxy for difficulty. Across the forty tasks, first-blood times span roughly two minutes to twenty-four hours and fifty-four minutes, a ratio of more than 700x. The authors show that first-blood time is a strong predictor of whether language-model agents can solve a task without help: in the initial evaluation, no model succeeded on any task whose first-blood time exceeded eleven minutes when run without subtask guidance.[^2]
Seventeen of the forty tasks are augmented with subtasks -- a sequence of intermediate questions whose answers represent natural sub-goals on the path to the final flag. For instance, a web exploitation task might be broken down into "What URL parameter is unsanitised?", "What payload bypasses the input filter?", and "What is the final flag?". Each subtask has an answer key against which the agent's response is automatically scored. Subtasks were authored manually by the Cybench team, often in consultation with the original challenge authors, and serve two purposes: they allow finer-grained measurement of partial progress, and they support a "subtask-guided" evaluation mode in which the agent is given the previous subtask's correct answer before attempting the next step.[^2]
Each evaluation run instantiates a fresh Kali Linux Docker container as the agent's working environment, plus -- where the task requires -- additional containers hosting the target services (a vulnerable web application, a remote binary server, etc.). Containers communicate over an isolated Docker network. Kali Linux was chosen because its default toolchain includes the standard penetration-testing utilities (Burp Suite, John the Ripper, Ghidra, gdb-peda, pwntools, nmap, sqlmap, and similar) that human CTF players rely on, lowering the chance that an agent fails purely for lack of tooling.[^2]
The reference Cybench agent issues bash commands inside the container and observes their stdout/stderr output. Its prompt is structured into five fields per turn -- Reflection, Plan and Status, Thought, Log, and Action -- a pattern adapted from earlier work on coding agents. The agent's memory is truncated to the initial system prompt plus the last three (response, observation) pairs to fit within model context windows; the original paper used 6,000 input tokens and 2,000 output tokens per call.[^2]
For unguided runs the agent is allotted fifteen iterations per task. For subtask-guided runs it is allotted five iterations per subtask. The paper also reports ablations using three alternative scaffolds -- an action-only agent, a pseudoterminal interface, and a web-search-enabled variant -- to characterise the sensitivity of results to scaffold design.[^2]
Cybench reports three primary metrics:
Flag strings are matched against a fixed answer key, so scoring is fully automated.
The launch paper evaluated eight language-model agents on the full benchmark: GPT-4o, OpenAI o1-preview, Claude 3 Opus, Claude 3.5 Sonnet (claude 3 5 sonnet), Mixtral 8x22B Instruct, Gemini 1.5 Pro, Llama 3 70B Chat, and Llama 3.1 405B Instruct.[^2]
In unguided mode the top score was 17.5 percent, achieved by Claude 3.5 Sonnet, followed by GPT-4o (gpt 4o) at 12.5 percent. Claude 3 Opus and OpenAI o1-preview each solved 10 percent of tasks. The remaining models clustered around or below 5 percent.[^2]
The picture was somewhat different in subtask mode. OpenAI o1-preview achieved the highest subtask completion percentage at 46.8 percent, indicating that with intermediate guidance it made considerable partial progress even on tasks it could not finish unaided. Claude 3.5 Sonnet reached 43.9 percent on the same metric. On the binary subtask-guided final-step metric, GPT-4o tied with Claude 3.5 Sonnet for the highest score at 17.5 percent.[^2]
A consistent finding across all models was the first-solve-time wall: no unguided agent solved any task whose first-blood time by human teams exceeded eleven minutes. The hardest task the models could complete -- a comparatively quick HackTheBox web challenge -- still represented a substantial step beyond what easier benchmarks like InterCode-CTF could detect, since most frontier models score above sixty percent on InterCode-CTF tasks. The authors also documented that explicit safety refusals were rare, occurring only five times across all runs (all by Claude 3 Opus), suggesting that existing alignment training does not materially impede CTF-style cyber evaluation.[^2]
Cybench is updated continuously by the original authors and third-party evaluators. By late 2025 the public leaderboard at cybench.github.io tracked many more frontier models, with scores having risen substantially as agentic scaffolding and base-model capability both improved.[^4]
Anthropic was the most frequent vendor reporter through 2025-2026, including Cybench numbers in the system cards for Claude 3.5 Sonnet, Claude 3.7 Sonnet, Claude 4 (Opus and Sonnet), claude opus 4 1, Claude Sonnet 4.5, Claude Haiku 4.5, claude opus 4 5, Claude Opus 4.6, and Claude Opus 4.7.[^5] Anthropic's Claude Opus 4.1 system card, released in August 2025, reported that the model solved 18 of a 35-task Cybench subset, up from 16 of 35 for Claude Opus 4.[^5] By the Claude Opus 4.6 system card (February 2026), Anthropic reported that the model had effectively saturated Cybench, achieving approximately 100 percent under a pass@30 evaluation and prompting the company to move primary cyber-capability tracking to larger successor benchmarks such as CyberGym.[^8]
OpenAI's gpt 5 family system cards likewise included Cybench, with the August 2025 GPT-5 system card and the December 2025 GPT-5.2 update reporting substantially higher unguided scores than the original 2024 numbers and using Cybench as one of the inputs to OpenAI's cyber capability-level classification under the preparedness framework.[^9] OpenAI classified GPT-5.5 as "High" capability in the cybersecurity domain (below "Critical") in late 2025, citing Cybench alongside internal capture-the-flag and operational-research evaluations.[^10]
google deepmind's Gemini 2.5 Deep Think model card (August 2025) reported that the model's Cybench scores exceeded prior Gemini releases while still failing on the hardest tasks -- characterised in the model card as those most representative of real-world penetration scenarios.[^11] gemini 2 5 pro and subsequent Gemini reports treated Cybench as part of the cyber capability evaluation suite under the frontier safety framework without reaching Critical Capability Levels in cyber. xai published Cybench numbers for grok 4 in its xAI Risk Management Framework, and Amazon included Cybench results in the Amazon Nova Premier Model Card.[^4][^6]
Cybench occupies an unusually central place in the AI safety evaluation ecosystem because it is one of very few open-source, professional-level cyber-offense benchmarks and because cyber-uplift is one of the canonical capability domains tracked by every major frontier risk framework.
Cybench is one of several CTF-style benchmarks that have been used to evaluate language-model agents on offensive cyber tasks. The most commonly compared peers are:
Cybench's distinguishing characteristics are its calibration to professional-level competition difficulty, its objective first-blood difficulty proxy, its subtask annotations, and its widespread adoption by frontier labs and AI safety institutes.
Cybench was well-received within the cyber and AI-safety research communities and rapidly became a de facto reference for cyber capability evaluation. Reviewers at ICLR 2025 praised its dataset construction and dual-use framing, and it has been cited heavily in subsequent cyber-benchmark papers. The benchmark's adoption by US AISI, UK AISI, and every major frontier lab is itself an indicator of its perceived rigour.[^1][^6]
Several limitations have been raised, however, both by the original authors and by external commentators:
Despite these limitations, Cybench is widely regarded as a reasonable open-source signal for tracking the cyber-offense capability frontier, particularly when combined with internal harder evaluations and with newer benchmarks such as CyberGym and BountyBench.