FeatureBench

AI Benchmarks AI Code Generation

9 min read

Updated Jun 3, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 3, 2026

Fact-checked

In review queue

Sources

4 citations

Revision

v1 · 1,775 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

FeatureBench is an execution-based benchmark for measuring how well LLM-powered coding agents handle complex, feature-oriented software development rather than bug fixing. It was introduced in a paper titled "FeatureBench: Benchmarking Agentic Coding for Complex Feature Development," submitted to arXiv on 11 February 2026 by Qixing Zhou, Jiacheng Zhang, Haiyang Wang and colleagues at the Institute of Automation, Chinese Academy of Sciences (CASIA) and Huawei Technologies. The first version contains 200 evaluation tasks and 3,825 executable environments drawn from 24 open-source Python repositories. Its headline finding is stark: Claude Opus 4.5, which scores 74.4% on SWE-bench, resolves only 11.0% of FeatureBench tasks, and the strongest configuration tested, GPT-5.1-Codex, reaches just 12.5%.^[1]^[2]^[3]

What FeatureBench measures

Most agentic coding benchmarks ask a model to fix a single bug described in one GitHub issue, usually scoped to one pull request. FeatureBench asks something harder: implement a whole feature. Each task gives the agent a high-level functional description, an explicit interface definition (function signature, import path, and input and output types), a Dockerfile defining the execution environment, and a blacklist of prohibited URLs to stop the agent from simply downloading the answer. The agent then has to produce a working, directly callable module, either by extending an existing codebase or by building it from scratch.^[1]

The authors split tasks into two difficulty levels. Level 1 (L1) asks the agent to add a feature incrementally to an existing repository; Level 2 (L2) requires building the same functionality from scratch. Both are graded the same way, by running tests.

These are not small jobs. An average FeatureBench task touches 790.2 lines of code across 15.7 files and 29.2 functions, with a problem statement averaging 4,818 words. The equivalent SWE-bench figures are roughly 32.8 lines, 1.7 files, 3 functions, and a 195-word statement. FeatureBench tasks are about an order of magnitude larger on every axis, which is most of why scores collapse.^[1]

How the execution-based evaluation works

FeatureBench borrows its grading protocol from SWE-bench: every task ships with fail-to-pass (F2P) tests and pass-to-pass (P2P) tests. F2P tests fail on the unfinished repository and should pass once the feature is correctly built; P2P tests must pass both before and after the agent's work, which catches solutions that implement the new feature but break something else. A task counts as resolved only when the agent's solution passes all associated tests. Because grading is purely execution-based, there is no LLM-as-a-judge step and no human scoring of partial credit, which the authors argue removes a major source of ambiguity in feature-development evaluation.^[1]

The benchmark reports three metrics: Resolved Rate (the fraction of tasks fully solved, the headline number, defined exactly as in SWE-bench), Passed Rate (the average fraction of F2P tests passed per task, a softer signal of partial progress), and Token I/O (the average tokens consumed, a proxy for cost). The token figures are large; solving the full set with Claude Opus 4.5 averaged about 7.5 million input tokens per task, which is part of why the authors also released a smaller subset. Note that the 3,825 figure refers to executable Docker environments, not evaluation tasks: the 200 headline tasks are the hardest, most reliable instances filtered from that larger pool, and the surplus environments let the benchmark be regenerated and expanded over time.^[1]

The automated, test-driven pipeline

The most novel part of FeatureBench is how the tasks are made. SWE-bench and similar benchmarks are pull-request driven: they scrape merged PRs and turn each into a task. The FeatureBench authors argue this approach cannot capture real features, because a single feature is often spread across multiple PRs scattered through a project's history, many PRs are untagged, and the resulting tasks are locked to whatever combinations developers happened to commit.^[1]

Instead, FeatureBench works backward from tests. Given a repository, the pipeline runs the following steps with almost no human input:^[1]

Stage	What happens
Environment setup	A maintainer specifies install commands (about three minutes per repo); scripts then build a Docker image. This is the only manual step, under one hour total for all 24 repositories.
Select tests	Using pytest collection, the system picks and validates F2P and P2P test files. Typically one file is the F2P target; five files are sampled as P2P.
Dynamic tracing	F2P and P2P tests run while Python's built-in tracing facility records every function call, producing an object dependency graph whose nodes carry a source location, dependency list, and a flag for whether the function ran during P2P tests.
Graph traversal	An LLM reads the F2P test file to separate the target feature's functions from test utilities. A breadth-first search then labels each reachable node: those also hit during P2P runs are "remained," those not seen in P2P are "extracted."
Code extraction	The "extracted" code is removed, yielding an undeveloped codebase plus a gold patch that re-implements the missing feature.
Post-verification	The stripped codebase must pass all P2P tests and fail all F2P tests; reapplying the gold patch must make everything pass again.
Problem statement	The interface signature and a functional description (from docstrings, or LLM-generated when absent) are assembled into the final prompt.

The dependency-graph step is what makes the extraction safe. Naively deleting code linked to a feature tends to break unrelated functionality; using P2P tests to mark which functions are load-bearing for the rest of the repository lets the pipeline cut out exactly the target feature and nothing else. The authors position this against prior synthetic-data tools such as SWE-Gym, R2E-Gym, SWE-Smith, and SWE-Flow, noting that SWE-Flow uses F2P tests but ignores P2P tests, so it cannot guarantee the rest of the codebase still works.^[1]

Why it differs from SWE-bench

SWE-bench has become the default yardstick for AI coding agents, and its verified subset is treated as a standard. The FeatureBench authors make a pointed observation about it: within a year, top SWE-bench scores climbed from under 10% to over 70%, which they read partly as genuine progress and partly as a sign the benchmark no longer challenges frontier agents. They also note that SWE-bench is dominated by bug fixing, with only about 18 to 22% of its instances corresponding to feature requests.^[1]

FeatureBench is built to attack the gap left by that distribution. Bug fixing usually means a small, localized change with a clear failing test pointing at the problem. Feature development means writing substantial new functionality, often from a blank file, against a specification rather than a stack trace. The example tasks make the difference concrete: adapting the Transformers library for compatibility with Qwen3, or engineering FlashAttention from scratch. Because the pipeline is automated and tied to test commit dates (it filters to tests first committed after May 2022, through September 2025), FeatureBench can be regenerated with newer repositories to fight data contamination, a problem static benchmarks struggle with once their tasks leak into training sets.^[1]

Results and difficulty

The paper evaluates a mix of closed- and open-weight models, each paired with the agent scaffold that suits it, and reports both a Full set (200 tasks) and a Lite set (30 randomly sampled tasks). The headline result is that no model solves more than the simplest tasks. The table below gives Resolved Rates on both splits.^[1]

Model and scaffold	Lite (30) resolved	Full (200) resolved
GPT-5.1-Codex (medium reasoning) + Codex	20.0%	12.5%
Claude Opus 4.5 + Claude Code	20.0%	11.0%
Claude Opus 4.5 + OpenHands	20.0%	10.5%
Gemini-3-Pro-Preview + OpenHands	10.0%	4.5%
Gemini-3-Pro-Preview + Gemini-CLI	10.0%	5.0%
DeepSeek-V3.2 + OpenHands	6.7%	5.5%
Qwen3-Coder-480B-A35B-Instruct + OpenHands	6.7%	3.5%

A few things stand out. The abstract's widely quoted comparison, Claude Opus 4.5 going from 74.4% on SWE-bench to 11.0% here, is real but not the single best score; GPT-5.1-Codex edges it at 12.5% on the full set. (The abstract writes the name as "Claude 4.5 Opus," the tables as "Claude Opus 4.5"; both are the same model.) Open-weight models trail the frontier closed models, with Qwen3-Coder at 3.5%. Passed Rates are far higher than Resolved Rates (Claude Opus 4.5 reaches a 43.29% Passed Rate against an 11.0% Resolved Rate), which tells you agents routinely get parts of a feature working without passing the complete test suite. Scores on the 30-task Lite set run roughly double the Full set, a reminder that small evaluation subsets can paint a more flattering picture.^[1]

To check that the automated pipeline produces sane tasks, a senior engineer with five years of industry experience manually revised the prompts in the Lite set. The paper reports that model performance on the manually revised subset is highly consistent with the original, which the authors offer as evidence that the auto-generated problem statements are faithful.^[1]

Significance

FeatureBench lands at a moment when the standard agentic coding benchmark is close to saturated, and it reopens a lot of headroom. An 11% ceiling for the best widely deployed coding agent resets expectations: strong performance on issue-level bug fixing does not transfer to building features end to end, where an agent has to hold a large specification in mind, write hundreds of lines across many files, and avoid breaking everything else in the repository. Whether 11% reflects a real capability gap or partly an artifact of very long, interface-constrained prompts is the obvious open question, and the Passed Rate numbers suggest agents are closer than the Resolved Rate alone implies.^[1]

Beyond the leaderboard, the bigger contribution may be the toolkit. Because every task ships with a verified, executable environment and the pipeline is automated, the same machinery that grades agents can generate training data for them. The authors note that the inherent verifiability of the constructed environments makes the method potentially valuable for agent training, positioning FeatureBench as both an evaluation and a data engine. The code, dataset, and project page are released publicly under the LiberCoders organization.^[1]^[3]^[4]

References

Zhou, Qixing; Zhang, Jiacheng; Wang, Haiyang; Hao, Rui; Wang, Jiahe; Han, Minghao; Yang, Yuxue; Wu, Shuzhe; Pan, Feiyang; Fan, Lue; Tu, Dandan; Zhang, Zhaoxiang. "FeatureBench: Benchmarking Agentic Coding for Complex Feature Development." arXiv:2602.10975, 11 February 2026. Full text: arxiv.org/html/2602.10975v1. ↩
"FeatureBench: Benchmarking Agentic Coding for Complex Feature Development." Hugging Face Papers. Accessed June 2026. ↩
LiberCoders. "FeatureBench project page: Beyond bug fixing and ship real features." Accessed June 2026. ↩
LiberCoders. "FeatureBench (GitHub repository)." Accessed June 2026. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

HumanEval

What FeatureBench measures

How the execution-based evaluation works

The automated, test-driven pipeline

Why it differs from SWE-bench

Results and difficulty

Significance

References

Improve this article

Related Articles

HumanEval

LiveCodeBench

SWE-bench Verified

MBPP

CodeContests

CRUXEval

What links here

Related Articles

HumanEval

LiveCodeBench

SWE-bench Verified

MBPP

CodeContests

CRUXEval