SkillsBench

AI Agents AI Benchmarks

7 min read

Updated Jun 3, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 3, 2026

Fact-checked

In review queue

Sources

4 citations

Revision

v1 · 1,470 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

SkillsBench is a benchmark for measuring whether Agent Skills, the structured packages of procedural knowledge that augment AI agents at inference time, actually improve how well those agents do real work. Introduced in a February 2026 arXiv paper titled "SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks" (arXiv:2602.12670), it treats Skills as first-class artifacts to be evaluated rather than as an undifferentiated part of a prompt.^[1] The benchmark covers 86 tasks across 11 domains, each paired with a curated Skill and a deterministic verifier, and it runs every task under three conditions: no Skills, curated Skills, and Skills the model writes for itself. The central result is blunt. Curated Skills raise the average pass rate by 16.2 percentage points, while self-generated Skills add nothing on average.^[1]^[2] The work is led by Xiangyi Li with around 40 co-authors drawn from academia and industry, and the task suite is maintained as an open-source project by BenchFlow.^[1]^[3]

What SkillsBench measures and why

Agent Skills emerged in late 2025 as a way to give an agent reusable know-how without retraining it. A Skill is typically a folder containing a short set of instructions, optional scripts, and reference material that an agent can load when it recognizes a relevant task. Anthropic shipped the format for Claude and its Claude Code agent, and similar conventions spread to other harnesses such as Google's Gemini CLI and OpenAI's Codex CLI. Community marketplaces sprang up to share thousands of Skills.^[3] The adoption ran ahead of any evidence. People were writing and downloading Skills on the assumption that more procedural context helps, but nobody had a standard way to check whether a given Skill made an agent better, worse, or no different.

SkillsBench exists to answer that question empirically. The authors frame the problem as a measurement gap: Skills are cheap to produce and easy to attach, so the relevant question is not whether an agent can use a Skill but whether the Skill changes the outcome on a task the agent would otherwise attempt unaided. By isolating the Skill as the variable and holding the model and task fixed, the benchmark turns a vague intuition into a number.

The task setup and verifiers

The 86 tasks were assembled through an open call rather than written by the core team alone. According to the paper, 105 contributors from academia and industry submitted 322 candidate tasks, which passed through a two-stage screen of automated validation and human review before 86 were kept.^[1] The survivors are stratified by difficulty into three tiers: 17 core tasks, 43 extended tasks, and 26 extreme tasks. They span 11 domains: Healthcare, Manufacturing, Cybersecurity, Natural Science, Energy, Office and White Collar work, Finance, Media and Content Production, Robotics, Mathematics, and Software Engineering.^[1]^[4] The spread is deliberate. A benchmark drawn only from coding would not say much about whether Skills generalize to, for example, a clinical calculation or a manufacturing tolerance check.

Each task runs inside its own Docker container preloaded with the domain-specific tools and data the task needs, which keeps runs reproducible and isolated from one another.^[2]^[3] The tasks use the Harbor task format and are driven through BenchFlow's evaluation tooling, with the project also offering compatibility paths for other reward and verifier standards.^[3]

Grading is deterministic and avoids an LLM-as-judge. Every task ships a custom pytest suite that checks the agent's output against ground-truth values within defined tolerances, so a pass or fail is a programmatic assertion rather than another model's opinion.^[2]^[3] This matters for a Skills benchmark in particular, because an LLM grader could reward output that merely looks more polished after a Skill is attached. The authors also describe a continuous-integration audit on the Skills themselves, meant to confirm that a curated Skill supplies guidance rather than smuggling in the answer. Without that check, a Skill that simply contained the solution would inflate the measured benefit and tell you nothing about procedural knowledge.

The three conditions

The experimental design is a within-task comparison across three settings:

Condition	What the agent receives
No Skills (baseline)	The task only, with no procedural guidance
Curated Skills	A human-authored Skill relevant to the task
Self-generated Skills	A Skill the model is prompted to write for itself before attempting the task

Running the same task three ways lets the benchmark separate two distinct claims that are often conflated: that procedural knowledge helps, and that a model can produce that knowledge on demand. The first is about consuming Skills; the second is about authoring them.

SkillsBench evaluates seven agent-model configurations over 7,308 logged trajectories.^[1] The models tested include Google's Gemini 3 Flash and Gemini 3 Pro, OpenAI's GPT-5.2, and Anthropic's Claude Opus 4.5, Claude Opus 4.6, Claude Sonnet 4.5, and Claude Haiku 4.5.^[2] Full trajectory logging means the run can be inspected after the fact, not just scored.

The key finding: curated versus self-generated Skills

Across every model and configuration, curated Skills helped. The headline figure is a 16.2 percentage point gain in average pass rate over the no-Skills baseline.^[1]^[2] Per model, the uplift ranged from about +13.6 to +23.3 percentage points, so the effect was consistent rather than carried by one outlier.^[2]

The effect was also far from uniform across domains. Healthcare tasks improved the most, by 51.9 percentage points, while Software Engineering tasks gained only 4.5 percentage points.^[1] One reading is that coding agents already carry a great deal of relevant procedure in their training, so an extra Skill is largely redundant, whereas a specialized clinical or scientific procedure is genuinely new information. The benefit was not universal even within the suite: 16 of the 84 tasks used in the delta analysis showed a negative change, meaning the Skill made the agent worse on those tasks.^[1] A Skill is not automatically harmless.

The sharper result concerns self-generated Skills. When models were asked to author their own Skill before attempting a task, the average benefit was essentially zero.^[1]^[2] The interpretation the authors draw is pointed: models cannot reliably write the procedural knowledge they clearly benefit from reading. Consuming a good Skill and producing one are different competencies, and current models are good at the first while weak at the second. That asymmetry undercuts a tempting shortcut, the idea that an agent could simply generate its own playbook on the fly and capture the same gains as a curated library.

Two further findings shape how Skills should be written. Focused Skills built from two or three modules outperformed long comprehensive documentation, with the two-to-three-module band giving the largest benefit at about +20.0 percentage points, and compact Skills beating exhaustive ones by close to fourfold in the authors' analysis.^[2] More text is not better; a tight, relevant Skill beats a sprawling manual. The benchmark also found that a smaller model equipped with Skills can match a larger model without them. In one reported comparison, Claude Haiku 4.5 with Skills reached 27.7 percent while Claude Opus 4.5 without Skills sat at 22.0 percent.^[2] Good procedural knowledge can substitute, at least partly, for raw model scale.

Significance

SkillsBench arrives at a moment when Skills are being shared and sold faster than anyone can vet them, and it supplies the first common yardstick for the format. The practical guidance falls out cleanly from the numbers. Curated Skills are worth the effort, especially outside well-trodden domains like software engineering; Skills should be short and focused rather than encyclopedic; and you cannot expect a model to bootstrap its own Skills and get the same lift. The finding that small models with Skills can rival larger ones without them also has a cost angle, since it suggests procedural context is sometimes a cheaper lever than a bigger model.

There are limits worth keeping in view. The pass-rate figures are tied to a specific slate of late-2025 and early-2026 models and a particular task suite, and the same Skill that helps one generation of model may matter less to the next as base capabilities improve. The presence of negative-delta tasks is a reminder that a Skill can interfere as well as assist. Even so, by insisting on deterministic verifiers, isolated environments, and an audit that Skills guide rather than solve, SkillsBench sets a methodological bar for evaluating a fast-moving and previously unmeasured part of the agent stack.^[1]^[3] The paper has been revised twice since its February 2026 debut, and the task suite remains open for new contributions.^[1]^[3]

References

Li, Xiangyi, et al. "SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks." arXiv:2602.12670, submitted 13 February 2026 (revised March 2026). https://arxiv.org/abs/2602.12670 ↩
"Introducing SkillsBench: The First Benchmark for Agent Skills." SkillsBench project blog. https://www.skillsbench.ai/blogs/introducing-skillsbench ↩
benchflow-ai/skillsbench. GitHub repository, BenchFlow. https://github.com/benchflow-ai/skillsbench ↩
SkillsBench, arXiv HTML (v1), domain list and task distribution. https://arxiv.org/html/2602.12670v1 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

OSWorld

What SkillsBench measures and why

The task setup and verifiers

The three conditions

The key finding: curated versus self-generated Skills

Significance

References

Improve this article

Related Articles

Tau2-bench

WebArena

Agent evaluation

AgentBench

GAIA benchmark

BrowserGym