Cybench

AI Benchmarks AI Safety Model Evaluation

16 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

15 citations

Revision

v4 · 3,189 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Cybench (short for Cybersecurity benchmark) is an open-source evaluation framework for measuring the cybersecurity capabilities and risks of large language model agents. It comprises forty professional-level capture-the-flag (CTF) challenges drawn from four high-profile cybersecurity competitions, together with seventeen tasks that have been further decomposed into intermediate subtasks for fine-grained scoring. The benchmark was released in August 2024 by a team of researchers based primarily at stanford university, with additional contributors from uc berkeley and the wider security community, and was subsequently accepted to the Thirteenth International Conference on Learning Representations (iclr) in 2025.^[1]^[2]

The framework was designed to address what its authors describe as a gap between earlier CTF-style benchmarks aimed at educational or amateur audiences (such as InterCode-CTF and NYU CTF Bench) and the kinds of professional-grade challenges used by industry practitioners. Each Cybench task is scaffolded so that an autonomous agent runs inside a cybersecurity sandbox -- a Kali Linux Docker container with full shell access -- and must locate a hidden flag string by exploiting vulnerabilities across web applications, cryptographic primitives, binary executables, network protocols, and forensic artefacts.^[3]

Since its release, Cybench has become one of the most widely cited cyber-offense benchmarks for frontier language models. It is the only open-source cyber benchmark included in the joint pre-deployment tests run by the uk aisi and us aisi on Anthropic's Claude 3.5 Sonnet and OpenAI's o1, and appears as a recurring evaluation in the system cards of anthropic, openai, google deepmind, xai, and amazon frontier models.^[4]^[5]^[6] Its prominence has also made it a target of saturation, with vendor-reported scores climbing from around 17.5 percent in the original 2024 study to near-ceiling figures in 2025-2026 system cards, prompting a wave of larger successor benchmarks.

Key facts

Attribute	Detail
Full name	Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models
Initial release	15 August 2024 (arXiv preprint v1)
Publication venue	ICLR 2025
Lead author	Andy K. Zhang (stanford university)
Senior authors	Dan Boneh, Daniel E. Ho, percy liang
Total authors	27
arXiv identifier	2408.08926
Code repository	github.com/andyzorigin/cybench
Project website	cybench.github.io
License	Apache 2.0
Total tasks	40
Subtask-equipped tasks	17
Source competitions	HackTheBox Cyber Apocalypse 2024, SekaiCTF 2022 and 2023, Glacier 2023, HKCert 2023
Categories	Cryptography, web exploitation, reverse engineering, binary exploitation (pwn), forensics, miscellaneous
Human first-solve range	~2 minutes to ~24 hours 54 minutes
Execution environment	Kali Linux Docker container
Recognition	First Prize, SafeBench (Center for AI Safety, 2024)

Background and motivation

By mid-2024 it had become clear that frontier language models, when wrapped in tool-using agent scaffolds, could carry out non-trivial chains of programming and shell actions. Empirical work on offensive cyber capability lagged this trend: existing benchmarks were largely composed of beginner-level challenges. InterCode-CTF, the most widely used predecessor, was built from picoCTF challenges intended for high-school competitors and had an average human solve time of roughly three and a half minutes. NYU CTF Bench, released in June 2024, broadened coverage to university-level CSAW tasks but still concentrated on tasks substantially below the difficulty seen at flagship competitions.^[7]

The Cybench team -- centred at the Stanford Center for Research on Foundation Models (CRFM) and the Stanford RegLab -- argued that a credible cyber-risk evaluation should target professional-level competitions whose first-blood solve times by elite human teams already span minutes to days, since such tasks are closer to the difficulty of real-world penetration testing and vulnerability research. They further argued that purely binary success-or-failure scoring discards too much information at the upper end of the difficulty curve, where most models still fail outright; subtask-level scoring would let evaluators track partial progress and detect emerging capabilities before models saturate end-to-end performance.^[1]^[2]

A secondary motivation, repeatedly emphasised in the paper, was the dual-use nature of automated offensive cyber capability. The authors framed Cybench as a tool for policymakers, frontier labs, and external evaluators tracking how quickly autonomous exploit-finding ability is developing, rather than as a leaderboard for hackers. This framing aligned the benchmark with the cyber-risk threshold work being conducted under the responsible scaling policy of anthropic, the preparedness framework of openai, and the frontier safety framework of google deepmind.

Dataset composition

Source competitions

The forty tasks in Cybench are drawn from four competitions selected for being recent, having strong public reputations, and supplying challenges across the full standard CTF category set.^[2]

HackTheBox -- Cyber Apocalypse 2024. An annual event run by the HackTheBox platform that attracts thousands of teams. Cybench draws a large subset of its tasks from this competition because of its breadth across categories and the availability of high-quality first-blood timing data.
SekaiCTF 2022 and 2023. A competition organised by Project Sekai CTF, known for sophisticated cryptographic and web challenges. Cybench includes tasks from both editions.
Glacier 2023. A high-difficulty event organised by the LosFuzzys team in Austria. Glacier challenges contribute several of the hardest tasks in the dataset.
HKCert CTF 2023. Run by the Hong Kong Computer Emergency Response Team Coordination Center, supplying tasks across cryptography, web, forensics, and reverse engineering.

Difficulty calibration

The Cybench paper emphasises first-blood time -- the time elapsed between the start of a competition and the first successful flag submission by any human team -- as an objective, externally validated proxy for difficulty. Across the forty tasks, first-blood times span roughly two minutes to twenty-four hours and fifty-four minutes, a ratio of more than 700x. The authors show that first-blood time is a strong predictor of whether language-model agents can solve a task without help: in the initial evaluation, no model succeeded on any task whose first-blood time exceeded eleven minutes when run without subtask guidance.^[2]

Subtasks

Seventeen of the forty tasks are augmented with subtasks -- a sequence of intermediate questions whose answers represent natural sub-goals on the path to the final flag. For instance, a web exploitation task might be broken down into "What URL parameter is unsanitised?", "What payload bypasses the input filter?", and "What is the final flag?". Each subtask has an answer key against which the agent's response is automatically scored. Subtasks were authored manually by the Cybench team, often in consultation with the original challenge authors, and serve two purposes: they allow finer-grained measurement of partial progress, and they support a "subtask-guided" evaluation mode in which the agent is given the previous subtask's correct answer before attempting the next step.^[2]

Methodology

Execution environment

Each evaluation run instantiates a fresh Kali Linux Docker container as the agent's working environment, plus -- where the task requires -- additional containers hosting the target services (a vulnerable web application, a remote binary server, etc.). Containers communicate over an isolated Docker network. Kali Linux was chosen because its default toolchain includes the standard penetration-testing utilities (Burp Suite, John the Ripper, Ghidra, gdb-peda, pwntools, nmap, sqlmap, and similar) that human CTF players rely on, lowering the chance that an agent fails purely for lack of tooling.^[2]

Agent scaffolding

The reference Cybench agent issues bash commands inside the container and observes their stdout/stderr output. Its prompt is structured into five fields per turn -- Reflection, Plan and Status, Thought, Log, and Action -- a pattern adapted from earlier work on coding agents. The agent's memory is truncated to the initial system prompt plus the last three (response, observation) pairs to fit within model context windows; the original paper used 6,000 input tokens and 2,000 output tokens per call.^[2]

For unguided runs the agent is allotted fifteen iterations per task. For subtask-guided runs it is allotted five iterations per subtask. The paper also reports ablations using three alternative scaffolds -- an action-only agent, a pseudoterminal interface, and a web-search-enabled variant -- to characterise the sensitivity of results to scaffold design.^[2]

Scoring

Cybench reports three primary metrics:

Unguided success rate. The fraction of the forty tasks the agent solves end-to-end without subtask hints.
Subtask-guided success rate. Among the seventeen subtask-equipped tasks, the fraction on which the agent solves the final subtask when previous subtask answers are revealed sequentially.
Subtask completion percentage. The mean fraction of subtasks completed across the seventeen subtask-equipped tasks.

Flag strings are matched against a fixed answer key, so scoring is fully automated.

Initial results (August 2024)

The launch paper evaluated eight language-model agents on the full benchmark: GPT-4o, OpenAI o1-preview, Claude 3 Opus, Claude 3.5 Sonnet (claude 3 5 sonnet), Mixtral 8x22B Instruct, Gemini 1.5 Pro, Llama 3 70B Chat, and Llama 3.1 405B Instruct.^[2]

In unguided mode the top score was 17.5 percent, achieved by Claude 3.5 Sonnet, followed by GPT-4o (gpt 4o) at 12.5 percent. Claude 3 Opus and OpenAI o1-preview each solved 10 percent of tasks. The remaining models clustered around or below 5 percent.^[2]

The picture was somewhat different in subtask mode. OpenAI o1-preview achieved the highest subtask completion percentage at 46.8 percent, indicating that with intermediate guidance it made considerable partial progress even on tasks it could not finish unaided. Claude 3.5 Sonnet reached 43.9 percent on the same metric. On the binary subtask-guided final-step metric, GPT-4o tied with Claude 3.5 Sonnet for the highest score at 17.5 percent.^[2]

A consistent finding across all models was the first-solve-time wall: no unguided agent solved any task whose first-blood time by human teams exceeded eleven minutes. The hardest task the models could complete -- a comparatively quick HackTheBox web challenge -- still represented a substantial step beyond what easier benchmarks like InterCode-CTF could detect, since most frontier models score above sixty percent on InterCode-CTF tasks. The authors also documented that explicit safety refusals were rare, occurring only five times across all runs (all by Claude 3 Opus), suggesting that existing alignment training does not materially impede CTF-style cyber evaluation.^[2]

Subsequent results and leaderboard

Cybench is updated continuously by the original authors and third-party evaluators. By late 2025 the public leaderboard at cybench.github.io tracked many more frontier models, with scores having risen substantially as agentic scaffolding and base-model capability both improved.^[4]

Anthropic was the most frequent vendor reporter through 2025-2026, including Cybench numbers in the system cards for Claude 3.5 Sonnet, Claude 3.7 Sonnet, Claude 4 (Opus and Sonnet), claude opus 4 1, Claude Sonnet 4.5, Claude Haiku 4.5, claude opus 4 5, Claude Opus 4.6, and Claude Opus 4.7.^[5] Anthropic's Claude Opus 4.1 system card, released in August 2025, reported that the model solved 18 of a 35-task Cybench subset, up from 16 of 35 for Claude Opus 4.^[5] By the Claude Opus 4.6 system card (February 2026), Anthropic reported that the model had effectively saturated Cybench, achieving approximately 100 percent under a pass@30 evaluation and prompting the company to move primary cyber-capability tracking to larger successor benchmarks such as CyberGym.^[8]

OpenAI's gpt 5 family system cards likewise included Cybench, with the August 2025 GPT-5 system card and the December 2025 GPT-5.2 update reporting substantially higher unguided scores than the original 2024 numbers and using Cybench as one of the inputs to OpenAI's cyber capability-level classification under the preparedness framework.^[9] OpenAI classified GPT-5.5 as "High" capability in the cybersecurity domain (below "Critical") in late 2025, citing Cybench alongside internal capture-the-flag and operational-research evaluations.^[10]

google deepmind's Gemini 2.5 Deep Think model card (August 2025) reported that the model's Cybench scores exceeded prior Gemini releases while still failing on the hardest tasks -- characterised in the model card as those most representative of real-world penetration scenarios.^[11] gemini 2 5 pro and subsequent Gemini reports treated Cybench as part of the cyber capability evaluation suite under the frontier safety framework without reaching Critical Capability Levels in cyber. xai published Cybench numbers for grok 4 in its xAI Risk Management Framework, and Amazon included Cybench results in the Amazon Nova Premier Model Card.^[4]^[6]

Role in AI safety evaluation

Cybench occupies an unusually central place in the AI safety evaluation ecosystem because it is one of very few open-source, professional-level cyber-offense benchmarks and because cyber-uplift is one of the canonical capability domains tracked by every major frontier risk framework.

Anthropic Responsible Scaling Policy. Cyber capability is a tracked Capability Threshold under the responsible scaling policy. Cybench scores appear in every Claude system card from late 2024 onwards as part of the evidence base for ASL classification.^[5]
OpenAI Preparedness Framework. Cyber is one of the original tracked categories under the preparedness framework. Cybench is included in OpenAI's Scalable Evaluations suite alongside internal CTF authoring and operational cyber tasks.^[9]
Google DeepMind Frontier Safety Framework. Cyber is a Critical Capability domain under the frontier safety framework. DeepMind's Gemini model cards consistently include Cybench results as inputs to cyber CCL classification.^[11]
Joint pre-deployment testing by AI Safety Institutes. The uk aisi and us aisi used Cybench as the sole open-source cyber benchmark in their joint pre-deployment evaluations of Claude 3.5 Sonnet and OpenAI o1 in late 2024 and early 2025.^[6]
Inspect Evals. UK AISI has packaged Cybench as a first-class evaluation in Inspect Evals, the open-source companion library to UK AISI's Inspect AI framework. This packaging makes it straightforward for third-party evaluators to run Cybench against new models in a reproducible Docker- or Kubernetes-based sandbox.^[12]
SafeBench. The Center for AI Safety awarded Cybench a First Prize in its SafeBench benchmark competition, citing its applicability to frontier-model evaluation and the rigour of its dataset construction.^[13]
METR. metr (Model Evaluation and Threat Research) has used Cybench as a reference benchmark alongside its own task suite, particularly when characterising the cyber subset of its broader autonomy evaluations.

Comparison to other cyber benchmarks

Cybench is one of several CTF-style benchmarks that have been used to evaluate language-model agents on offensive cyber tasks. The most commonly compared peers are:

InterCode-CTF (2023). The predecessor benchmark from Princeton, drawn entirely from picoCTF and aimed at high-school competitors. Average human solve time is about three and a half minutes per task. Frontier models routinely score above 60 percent zero-shot on InterCode-CTF, and saturation occurred quickly.^[7]
NYU CTF Bench (June 2024). A university-level benchmark constructed from NYU CSAW CTF challenges. It is broader than InterCode-CTF but still substantially easier than the Cybench distribution. Comparative studies report that the same Qwen-3-32B model that solves 4.5 percent of NYU CTF tasks zero-shot solves 12.5 percent of Cybench tasks and 69.7 percent of InterCode-CTF tasks -- illustrating the difficulty stacking among the three benchmarks.^[7]
CyberGym (2025). A much larger UC Berkeley benchmark released in 2025 containing roughly 1,500 tasks. CyberGym is increasingly used by frontier labs once they saturate Cybench.^[8]
CAIBench (October 2025). A meta-benchmark that combines knowledge questions, Jeopardy-style CTFs, cyber-range scenarios, and attack-and-defence games into a unified evaluation, drawing on Cybench among other sources.^[14]
BountyBench. A companion project to Cybench from largely the same Stanford team, evaluating offensive and defensive agents on real-world software vulnerability scenarios rather than competition CTFs.

Cybench's distinguishing characteristics are its calibration to professional-level competition difficulty, its objective first-blood difficulty proxy, its subtask annotations, and its widespread adoption by frontier labs and AI safety institutes.

Reception and criticism

Cybench was well-received within the cyber and AI-safety research communities and rapidly became a de facto reference for cyber capability evaluation. Reviewers at ICLR 2025 praised its dataset construction and dual-use framing, and it has been cited heavily in subsequent cyber-benchmark papers. The benchmark's adoption by US AISI, UK AISI, and every major frontier lab is itself an indicator of its perceived rigour.^[1]^[6]

Several limitations have been raised, however, both by the original authors and by external commentators:

Distance from nation-state-level cyber capability. Competition CTFs typically use synthetic vulnerabilities placed in deliberately constrained environments. Critics point out that solving such tasks may not generalise to discovering novel vulnerabilities in production software, evading mature defensive monitoring, or chaining exploits through real-world infrastructure -- the capabilities most relevant to catastrophic cyber risk. Cybench scores are therefore a lower bound on real-world relevance at best.^[15]
Saturation pressure. By 2026, leading models report near-ceiling unguided scores on Cybench, reducing its discriminative power. This has motivated Anthropic and others to migrate primary cyber-capability tracking to CyberGym and to internal harder evaluations.^[8]
Training-data contamination. Because all four source competitions publish writeups and solutions after the event, there is concern that frontier models trained on web data have memorised solutions for at least some tasks. The original paper notes this risk and discusses partial mitigations (e.g., using post-cutoff competitions for newer evaluations) but acknowledges that contamination cannot be fully ruled out.^[2]
Scaffold sensitivity. Cybench scores depend non-trivially on scaffold design (tool access, memory window, iteration budget). The original paper documents this with scaffold ablations, but cross-paper comparisons can be misleading if scaffolds differ.^[2]
Coverage gaps. Cybench focuses on offensive capability against narrow targets; it does not evaluate defensive, blue-team, social-engineering, or operational-security capabilities, all of which are part of a full cyber threat picture.

Despite these limitations, Cybench is widely regarded as a reasonable open-source signal for tracking the cyber-offense capability frontier, particularly when combined with internal harder evaluations and with newer benchmarks such as CyberGym and BountyBench.

References

Zhang, Andy K., et al. "Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models." OpenReview, ICLR 2025. https://openreview.net/forum?id=tc90LV0yRL ↩
Zhang, Andy K., et al. "Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models." arXiv:2408.08926, 15 August 2024 (latest version April 2025). https://arxiv.org/abs/2408.08926 ↩
Cybench project website. https://cybench.github.io/ ↩
Cybench leaderboard. https://cybench.github.io/ ↩
Anthropic. "System Card: Claude Opus 4.1." August 2025. https://www.anthropic.com/claude-opus-4-1-system-card ↩
AI Safety Institute (UK). "Pre-deployment evaluation of Anthropic's Claude 3.5 Sonnet (joint with US AISI)." https://www.aisi.gov.uk/ ↩
NYU CTF Bench. "A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security." arXiv:2406.05590. https://arxiv.org/abs/2406.05590 ↩
Anthropic. "System Card: Claude Opus 4.6." February 2026. https://www.anthropic.com/claude-opus-4-6-system-card ↩
OpenAI. "GPT-5 System Card." August 2025. https://cdn.openai.com/gpt-5-system-card.pdf ↩
OpenAI. "GPT-5.5 System Card -- Cybersecurity." Deployment Safety Hub. https://deploymentsafety.openai.com/gpt-5-5/cybersecurity ↩
Google DeepMind. "Gemini 2.5 Deep Think Model Card." August 2025. https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-2-5-Deep-Think-Model-Card.pdf ↩
UK AISI. "Cybench (CTFs) -- Inspect Cyber." https://inspect.cyber.aisi.org.uk/cybench.html ↩
Center for AI Safety. "Announcing the SafeBench Winners." https://www.mlsafety.org/safebench/winners ↩
"CAIBench: A Meta-Benchmark for Evaluating Cybersecurity AI Agents." arXiv:2510.24317. https://arxiv.org/abs/2510.24317 ↩
Center for Research on Foundation Models, Stanford. "Cybench blog post." 19 August 2024. https://crfm.stanford.edu/2024/08/19/cybench.html ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributor · full history

Suggest edit

What links here

BountyBench