ZeroBench
Last reviewed
May 31, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 2,071 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 2,071 words
Add missing citations, update stale details, or suggest a clearer explanation.
ZeroBench is a visual reasoning benchmark built to be effectively impossible for current frontier large multimodal models, which score 0.0% on its main questions. It pairs a small main set of 100 manually curated, very hard visual questions with 334 easier subquestions that break those problems into intermediate steps. The benchmark was introduced in 2025 by Jonathan Roberts and a large group of collaborators in the paper "ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models" (arXiv:2502.09696) [1]. Roberts is a PhD student at the University of Cambridge, and the project grew out of a frustration that mainstream multimodal benchmarks were saturating faster than they could be replaced [1][7].
The name captures the design goal. Where most benchmarks report how far ahead the leading model is, ZeroBench reports zero: at release, none of the 20 models the authors tested could answer a single main question reliably [1][2]. That makes it less a leaderboard and more a stake in the ground, a deliberately overshot target meant to stay informative while computer vision systems keep improving.
The motivation is a familiar problem with a sharp new edge. As multimodal models have improved, the popular tests used to measure them have been solved almost as quickly as they appear. MMMU, the college-level multi-discipline benchmark released in 2023, was meant to probe expert reasoning across art, medicine, science, and engineering. Within roughly a year, GPT-4o reached about 69% on it, and by 2026 the leading frontier models were clearing 80% on the harder MMMU-Pro variant, with the field separated by only a few points [3][8]. When the gap between the best models shrinks to noise, the benchmark stops telling you which model is actually better.
This is the headroom problem. A benchmark is useful only while there is room left to climb. Once scores cluster near the ceiling, small differences get drowned out by prompt formatting, answer parsing, and luck, and the test no longer drives progress. Each new model release shortens the useful life of the benchmarks built for the previous generation, so researchers keep paying the cost of building fresh ones [1].
Roberts and colleagues argue there is a standing need for tests that stay relevant for longer, which in practice means tests with a lot of headroom from the start. Their bet is blunt: rather than aim for a benchmark that frontier models find merely hard, aim for one they cannot do at all, then watch how the floor rises over time. A score of zero, they note, has the unusual property of being impossible to saturate on day one, which buys the benchmark a longer shelf life [1][2].
ZeroBench is small on purpose. The main set has exactly 100 questions, each one written and checked by hand. More than 20 human question creators contributed candidates, and the authors started from a larger pool of about 140 before filtering down to the final 100 [1][2]. Keeping the set tiny is a feature: a 100-item benchmark is cheap to run, even against slow and expensive reasoning models, which matters when each evaluation may involve multiple samples per question.
The questions span many domains, reasoning types, and image styles. The released set breaks down into 93 single-image and 7 multi-image questions, with 31 built from synthetic images and 69 from natural photographs [1]. The problems lean heavily on careful looking rather than outside knowledge: reading cluttered scenes, counting objects, tracing paths, comparing sizes, and working out spatial relationships that a person can resolve with patience but a model tends to fumble.
Making a question genuinely impossible turned out to be harder than it sounds. The authors calibrated candidates against strong reasoning models during review, using o1 pro and QVQ as adversarial baselines, and any question a baseline answered correctly was either cut or made harder [1]. Many items went through several rounds of sharpening before they reliably defeated the models. The result is a set tuned to sit just past the current frontier, not absurdly difficult for its own sake.
The 334 subquestions are the other half of the design. Each main question is decomposed into the intermediate steps needed to solve it, an average of roughly 3.3 subquestions per main question [1][2]. A main question might ask for a final count or a derived measurement; its subquestions ask for the smaller observations along the way, such as identifying an object or reading one value off a chart. These pieces are easier, and because models can sometimes answer them, the subquestions give the benchmark a continuous signal even when the main set returns a flat zero.
The central finding is stark. The authors evaluated 20 large multimodal models, spanning proprietary and open-weight systems as well as dedicated reasoning models such as o1, o1 pro, Gemini 2 Flash Thinking, and QVQ. Under greedy decoding, every one of them scored 0.0% pass@1 on the 100 main questions [1][2]. Allowing several stochastic samples per question barely moved the needle: the best model managed only a handful of correct answers across all five tries, and on the strict 5/5 reliability metric, where a model must get the same question right in every one of five samplings, no model answered a single main question consistently [1][2].
The subquestions tell a more graded story, which is exactly what they were added for. The strongest model on the subquestions was Claude 3.5 Sonnet v2, which reached 24.30% pass@1 by answering 81 of the 334 correctly. Gemini 2 Flash, o1 pro, and GPT-4o followed closely at roughly 22% and 21% [1]. So the leading systems can handle a quarter of the easy intermediate steps yet still fail to chain them into a single complete answer on the main set. The gap between solving the parts and solving the whole is a large part of what ZeroBench is measuring.
| Metric | Best result | Detail |
|---|---|---|
| Main questions, pass@1 (greedy) | 0.0% | All 20 evaluated models |
| Main questions, 5/5 reliability | 0% | No question answered correctly in all 5 samples by any model |
| Subquestions, pass@1 | 24.30% (81/334) | Claude 3.5 Sonnet v2 |
| Subquestions, pass@1 (runners-up) | ~22%, ~22%, ~21% | Gemini 2 Flash, o1 pro, GPT-4o |
| Main set size | 100 questions | Manually curated |
| Subquestion set size | 334 questions | Avg ~3.3 per main question |
A flat zero would be uninformative on its own, so the authors looked closely at how the models fail. The errors are heavily skewed toward visual interpretation rather than logic or missing knowledge [1]. The models are not, for the most part, reasoning badly about what they see; they are seeing it wrong in the first place. Common failure modes include miscounting objects, an inability to resolve fine-grained detail, and trouble with spatial relations [1]. A model will confidently describe a coherent line of reasoning built on an initial reading of the image that is simply mistaken.
That diagnosis matters because it points at the bottleneck. If the failures were mostly logical, better reasoning techniques or longer chains of thought might close the gap. Because they are mostly perceptual, the harder limit appears to be how well these systems actually parse an image, especially at high resolution and fine detail. The authors note that a breakthrough allowing significantly higher-resolution visual input could be one of the larger levers for raising scores on ZeroBench [1]. In other words, the benchmark is less a test of thinking and more a test of looking, and current models are weaker at the looking.
The pairing of "lightweight" and "impossible" is the benchmark's signature, and the two properties reinforce each other. Lightweight means the whole main set is 100 questions, so even costly reasoning models can be run end to end without a large compute bill, and the subquestions add only a few hundred more items [1][2]. Impossible means the main set returns 0.0% from every frontier model at release, giving the benchmark the maximum possible headroom and protecting it from immediate saturation [2].
The authors also describe the set as diverse and high-quality, since each question was hand-written and checked rather than scraped or templated [1]. Quality control continued after release. In the first week the team ran a public red-teaming effort to surface mistakes in the questions and answers, acknowledging that some errors likely remain in a hand-built set of this kind [2][4]. The data, code, and red-teaming materials were released openly so others can probe and extend the benchmark [4][6].
The asterisk in the paper's own title, "An Impossible* Visual Benchmark," is a wink at the obvious caveat: impossible today does not mean impossible forever. The claim is about contemporary models at the moment of release, not a permanent ceiling [1][6].
ZeroBench is best understood as a reaction to the benchmarks it sits beside. MMMU and its tougher follow-up MMMU-Pro test broad, expert-level knowledge across many subjects, with thousands of questions drawn from exams and textbooks [3]. They are large and comprehensive, and that breadth is exactly why they saturate: once a model has absorbed enough of the world's text and diagrams, much of the set becomes answerable, and scores pile up near the top [8]. Other visual benchmarks that came before, covering chart reading, document understanding, and general multimodal question answering, followed the same arc.
ZeroBench inverts several of those choices. It is deliberately small rather than comprehensive, it targets perception and multi-step visual reasoning rather than encyclopedic knowledge, and it is calibrated to defeat the best models rather than to be answerable by good ones. It does not try to replace MMMU as a measure of broad competence. It serves a narrower purpose: a high-headroom stress test that stays discriminative precisely in the regime where the big knowledge benchmarks have gone quiet. The two are complementary, one mapping general breadth and the other marking the current frontier of hard visual reasoning [1][3].
The authors are candid about the trade-offs. The most obvious is size. With only 100 main questions, the benchmark gives a coarse signal, and the subquestions exist largely to recover finer resolution; a model that improves by a few questions moves the headline number by whole percentage points [1][2]. A small hand-curated set is also more exposed to residual errors in individual questions, which is why the red-teaming pass was part of the release [4].
The deeper limitation is built into the premise. A benchmark calibrated to be impossible for today's models is, by construction, a moving target, and the authors expect scores to climb as models gain better visual resolution and stronger reasoning [1]. The 0.0% headline is a snapshot, not a property of the task, and it will erode the moment a new model learns to see a little more carefully. There is also a practical worry that publishing the questions and answers openly can shorten a benchmark's life if the data leaks into training sets, a hazard ZeroBench shares with every public benchmark [1]. Its makers seem to accept all of this. The point was never a permanent wall but a temporary one set high enough to be useful while it lasts, and a clear way to watch the floor rise underneath it.