Vibe Code Bench
Last reviewed
Jun 3, 2026
Sources
5 citations
Review status
Source-backed
Revision
v1 · 1,701 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
5 citations
Review status
Source-backed
Revision
v1 · 1,701 words
Add missing citations, update stale details, or suggest a clearer explanation.
Vibe Code Bench (VCB) is a benchmark that measures whether AI models can build a complete, deployable web application from nothing but a written specification. It was introduced in a March 2026 paper by researchers at Vals AI together with Alex Gu of MIT, and it consists of 100 application specifications paired with 964 browser-based workflows that an autonomous browser agent runs against each app the model produces. Across 16 frontier models, the best, GPT-5.3-Codex, reached only 61.8% accuracy on the held-out test split, which the authors take as evidence that reliable end-to-end app development is still an unsolved problem.[1][2]
The benchmark is named for vibe coding, the practice of describing software in natural language and letting an AI coding assistant or agent generate and wire up the code. Where most coding benchmarks ask a model to patch one bug in an existing repository, VCB asks it to start from an empty workspace and ship something a person could log into and use.[1][3]
Each task in Vibe Code Bench gives a model a single product specification, the kind a founder or product manager might hand to an engineer, and asks it to deliver a working application. The 100 specifications are split into a 50-app public validation set and a 50-app held-out test set, and they are grouped into three domains: 24 personal or "Individual" apps, 45 "Solo Founder" products, and 31 "Enterprise Tool" apps. Roughly a quarter of the apps (28 of 100) require integrating an external service such as payments or email, which raises the difficulty considerably.[1]
The model does not just write code into a vacuum. It works inside an isolated container built on a modified version of the OpenHands agent framework, using a Docker-in-Docker setup that gives it a full workspace. It gets a terminal for installing dependencies and running builds, a web browser for reading documentation and testing its own app, and access to production-style services: a Supabase backend for the database, authentication, and object storage, MailHog for SMTP email, and Stripe in sandbox mode for payments. In total the agent is handed 22 tools, including bash execution, a file editor, a browser, a task tracker, eleven Supabase calls, and the Tavily web-search and crawl tools. The required stack is fixed: a React frontend built with Vite and Tailwind CSS, a Supabase backend, and a Docker Compose deployment with a standard entry point.[1]
Crucially, the model is given a long leash. Each application has a five-hour wall-clock budget, so VCB is explicitly testing multi-hour agent sessions rather than single-shot generation. This matters because building a real app is not one decision but hundreds of them strung together: scaffolding the project, designing a schema, writing routes and components, debugging the build, then checking that the login flow and the checkout actually work.[1]
Once a model finishes, the app is evaluated the way a user would experience it, through the browser, not by reading its source. The harness first verifies that the app deploys cleanly via Docker Compose. If deployment fails, every workflow for that app is marked failed. For apps that come up, an autonomous browser agent (Browser Use with vision enabled, driven by Claude Sonnet 4.5 as the canonical evaluator) opens a fresh 1920x1200 headless session for each workflow and performs the user actions, up to 100 agent steps per workflow.[1]
A workflow is a scripted user journey such as "sign up, create a project, invite a teammate." Each workflow is broken into substeps, and the 964 workflows across the benchmark contain 10,131 substeps in total (the validation split has 491 workflows and 4,995 substeps; the test split has 473 workflows and 5,136 substeps, at 6 to 23 workflows per app). An LLM judge issues a pass or fail on each substep, a workflow counts as passing only when at least 90% of its substeps succeed, and an app's final accuracy is the percentage of its workflows that pass. Every workflow runs in its own browser session with unique account and data values so that one workflow cannot contaminate another.[1]
This "does the feature actually work in a browser" approach is what makes VCB hard to game. A model cannot earn credit for code that merely looks structurally correct, because the grade comes from clicking through the running product.[1][2]
The motivation behind Vibe Code Bench is that the standard agentic-coding benchmark, SWE-bench and its Verified variant, has largely saturated. By early 2026 the top models were clustered in a narrow band near 80% on SWE-bench Verified, with the leaders separated by fractions of a point, and OpenAI's own evaluations team had flagged that models could sometimes reproduce gold patches verbatim from a task ID, a sign of contamination. When everyone scores about the same, the benchmark stops telling you who is actually better.[4][5]
VCB spreads models out far more. The paper reports that the gap between MiniMax M2.5 and Claude Opus 4.6 is just 2.8 percentage points on SWE-bench but 42.7 points on Vibe Code Bench. The authors attribute this wider spread to three properties of the task. It is "zero to one": building a whole app from scratch surfaces capability gaps that a one-file patch hides. It is multi-file and multi-service: configuration, deployment, and live integrations with Stripe and email expose skills that isolated code edits never touch. And it is implementation-agnostic: because grading happens through the browser, syntactic compliance is not enough. Difficulty inside the benchmark also varies sharply; GPT-5.3-Codex scores 81.9% on the easiest tasks but only 13.1% on the hardest, and apps requiring both payments and email drop it from about 71% to 30%.[1][2]
The 16 models were evaluated on the held-out test split. The table below lists the reported accuracy with standard error, along with median cost and latency per app.[1]
| Rank | Model | Accuracy ± SE (%) | Cost | Latency |
|---|---|---|---|---|
| 1 | GPT-5.3-Codex | 61.77 ± 4.71 | $11.91 | 75.8m |
| 2 | Claude Opus 4.6 | 57.57 ± 4.37 | $8.69 | 21.3m |
| 3 | GPT-5.2 | 53.50 ± 5.07 | $17.75 | 82.9m |
| 4 | Claude Opus 4.6 Thinking | 53.50 ± 4.68 | $8.28 | 23.1m |
| 5 | Claude Sonnet 4.6 | 51.48 ± 4.64 | $5.91 | 26.2m |
| 6 | GPT-5.2-Codex | 37.91 ± 4.58 | $4.15 | 32.2m |
| 7 | Gemini 3.1 Pro | 32.03 ± 4.34 | $3.83 | 20.2m |
| 8 | GLM-5 | 23.36 ± 4.03 | $40.27 | 224.3m |
| 9 | Gemini 3 Flash | 20.20 ± 3.95 | $0.94 | 13.4m |
| 10 | Kimi-K2.5 Thinking | 17.54 ± 3.26 | $0.88 | 42.8m |
| 11 | Qwen 3.5 Plus Thinking | 15.74 ± 3.18 | $3.80 | 50.3m |
| 12 | MiniMax M2.5 | 14.85 ± 2.95 | $2.20 | 51.1m |
| 13 | GPT-5 Mini | 14.17 ± 3.54 | $0.25 | 11.6m |
| 14 | Claude Haiku 4.5 Thinking | 11.39 ± 3.13 | $1.31 | 12.9m |
| 15 | DeepSeek V3.2 Thinking | 5.11 ± 2.13 | $2.47 | 56.1m |
| 16 | Grok 4.1 Fast Reasoning | 1.20 ± 1.20 | $0.21 | 8.8m |
The headline number, 61.8% for the leader, looks less like a victory once you read it as the paper intends: even the best model leaves roughly a third of workflows broken. A live leaderboard maintained by Vals AI has since tracked newer releases, with Claude Opus 4.7 reported at 71.00%, GPT-5.5 at 69.85%, and GPT-5.4 at 67.42% on a public snapshot, but the same one-third-failure framing holds at the top.[2]
The strongest behavioral finding is about self-testing. Models that used their browser to check their own work as they built it scored substantially higher, with a Pearson correlation of r=0.72 between self-testing and accuracy, a relationship that held even after controlling for how long generation took. By contrast, sheer edit volume barely correlated with success (r=0.09). In other words, the models that win are not the ones that type the most code; they are the ones that stop and verify whether the thing they just built actually works.[1]
Because so much of the grading is done by an LLM judge, the authors ran a human alignment study, collecting 1,401 unique substep judgments across 18 applications. Human annotators agreed with each other 88.6% to 93.6% of the time. Model judges varied far more. Claude Sonnet 4.5 reached 86.4% agreement with humans, whereas GPT-5.2 as a judge agreed with humans only 36.1% of the time, and overall pairwise step-level agreement ranged from 31.8% to 93.6% depending on the evaluator pair. The choice of judge materially changes the scores, which is why VCB fixes Claude Sonnet 4.5 as its canonical evaluator and ships an alignment protocol alongside the benchmark.[1][2]
Vibe Code Bench arrived as the field was looking for harder, less saturated tests of agentic coding. Its contribution is to evaluate the full loop a vibe coder cares about, spec to deployed app to a real user clicking through it, rather than the narrow issue-to-patch task that defined the previous generation of benchmarks. The results push back on the impression, easy to form from saturated leaderboards, that AI has nearly solved software engineering. Building a complete, multi-feature web app that works end to end is something even the best 2026 models get right only about three times out of five, and the apps that involve money and email are harder still.[1][2][3]