ZeroBench

AI Benchmarks Computer Vision Multimodal AI

10 min read

Updated May 31, 2026

Suggest edit History Talk

RawGraph

Last edited

May 31, 2026

Fact-checked

In review queue

Sources

10 citations

Revision

v1 · 2,071 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

ZeroBench is a visual reasoning benchmark built to be effectively impossible for current frontier large multimodal models, which score 0.0% on its main questions. It pairs a small main set of 100 manually curated, very hard visual questions with 334 easier subquestions that break those problems into intermediate steps. The benchmark was introduced in 2025 by Jonathan Roberts and a large group of collaborators in the paper "ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models" (arXiv:2502.09696) ^[1]. Roberts is a PhD student at the University of Cambridge, and the project grew out of a frustration that mainstream multimodal benchmarks were saturating faster than they could be replaced ^[1]^[7].

The name captures the design goal. Where most benchmarks report how far ahead the leading model is, ZeroBench reports zero: at release, none of the 20 models the authors tested could answer a single main question reliably ^[1]^[2]. That makes it less a leaderboard and more a stake in the ground, a deliberately overshot target meant to stay informative while computer vision systems keep improving.

Why another visual benchmark

The motivation is a familiar problem with a sharp new edge. As multimodal models have improved, the popular tests used to measure them have been solved almost as quickly as they appear. MMMU, the college-level multi-discipline benchmark released in 2023, was meant to probe expert reasoning across art, medicine, science, and engineering. Within roughly a year, GPT-4o reached about 69% on it, and by 2026 the leading frontier models were clearing 80% on the harder MMMU-Pro variant, with the field separated by only a few points ^[3]^[8]. When the gap between the best models shrinks to noise, the benchmark stops telling you which model is actually better.

This is the headroom problem. A benchmark is useful only while there is room left to climb. Once scores cluster near the ceiling, small differences get drowned out by prompt formatting, answer parsing, and luck, and the test no longer drives progress. Each new model release shortens the useful life of the benchmarks built for the previous generation, so researchers keep paying the cost of building fresh ones ^[1].

Roberts and colleagues argue there is a standing need for tests that stay relevant for longer, which in practice means tests with a lot of headroom from the start. Their bet is blunt: rather than aim for a benchmark that frontier models find merely hard, aim for one they cannot do at all, then watch how the floor rises over time. A score of zero, they note, has the unusual property of being impossible to saturate on day one, which buys the benchmark a longer shelf life ^[1]^[2].

How the questions were built

ZeroBench is small on purpose. The main set has exactly 100 questions, each one written and checked by hand. More than 20 human question creators contributed candidates, and the authors started from a larger pool of about 140 before filtering down to the final 100 ^[1]^[2]. Keeping the set tiny is a feature: a 100-item benchmark is cheap to run, even against slow and expensive reasoning models, which matters when each evaluation may involve multiple samples per question.

The questions span many domains, reasoning types, and image styles. The released set breaks down into 93 single-image and 7 multi-image questions, with 31 built from synthetic images and 69 from natural photographs ^[1]. The problems lean heavily on careful looking rather than outside knowledge: reading cluttered scenes, counting objects, tracing paths, comparing sizes, and working out spatial relationships that a person can resolve with patience but a model tends to fumble.

Making a question genuinely impossible turned out to be harder than it sounds. The authors calibrated candidates against strong reasoning models during review, using o1 pro and QVQ as adversarial baselines, and any question a baseline answered correctly was either cut or made harder ^[1]. Many items went through several rounds of sharpening before they reliably defeated the models. The result is a set tuned to sit just past the current frontier, not absurdly difficult for its own sake.

The 334 subquestions are the other half of the design. Each main question is decomposed into the intermediate steps needed to solve it, an average of roughly 3.3 subquestions per main question ^[1]^[2]. A main question might ask for a final count or a derived measurement; its subquestions ask for the smaller observations along the way, such as identifying an object or reading one value off a chart. These pieces are easier, and because models can sometimes answer them, the subquestions give the benchmark a continuous signal even when the main set returns a flat zero.

The near-zero results

The central finding is stark. The authors evaluated 20 large multimodal models, spanning proprietary and open-weight systems as well as dedicated reasoning models such as o1, o1 pro, Gemini 2 Flash Thinking, and QVQ. Under greedy decoding, every one of them scored 0.0% pass@1 on the 100 main questions ^[1]^[2]. Allowing several stochastic samples per question barely moved the needle: the best model managed only a handful of correct answers across all five tries, and on the strict 5/5 reliability metric, where a model must get the same question right in every one of five samplings, no model answered a single main question consistently ^[1]^[2].

The subquestions tell a more graded story, which is exactly what they were added for. The strongest model on the subquestions was Claude 3.5 Sonnet v2, which reached 24.30% pass@1 by answering 81 of the 334 correctly. Gemini 2 Flash, o1 pro, and GPT-4o followed closely at roughly 22% and 21% ^[1]. So the leading systems can handle a quarter of the easy intermediate steps yet still fail to chain them into a single complete answer on the main set. The gap between solving the parts and solving the whole is a large part of what ZeroBench is measuring.

Metric	Best result	Detail
Main questions, pass@1 (greedy)	0.0%	All 20 evaluated models
Main questions, 5/5 reliability	0%	No question answered correctly in all 5 samples by any model
Subquestions, pass@1	24.30% (81/334)	Claude 3.5 Sonnet v2
Subquestions, pass@1 (runners-up)	~22%, ~22%, ~21%	Gemini 2 Flash, o1 pro, GPT-4o
Main set size	100 questions	Manually curated
Subquestion set size	334 questions	Avg ~3.3 per main question

What the errors reveal

A flat zero would be uninformative on its own, so the authors looked closely at how the models fail. The errors are heavily skewed toward visual interpretation rather than logic or missing knowledge ^[1]. The models are not, for the most part, reasoning badly about what they see; they are seeing it wrong in the first place. Common failure modes include miscounting objects, an inability to resolve fine-grained detail, and trouble with spatial relations ^[1]. A model will confidently describe a coherent line of reasoning built on an initial reading of the image that is simply mistaken.

That diagnosis matters because it points at the bottleneck. If the failures were mostly logical, better reasoning techniques or longer chains of thought might close the gap. Because they are mostly perceptual, the harder limit appears to be how well these systems actually parse an image, especially at high resolution and fine detail. The authors note that a breakthrough allowing significantly higher-resolution visual input could be one of the larger levers for raising scores on ZeroBench ^[1]. In other words, the benchmark is less a test of thinking and more a test of looking, and current models are weaker at the looking.

Lightweight but impossible

The pairing of "lightweight" and "impossible" is the benchmark's signature, and the two properties reinforce each other. Lightweight means the whole main set is 100 questions, so even costly reasoning models can be run end to end without a large compute bill, and the subquestions add only a few hundred more items ^[1]^[2]. Impossible means the main set returns 0.0% from every frontier model at release, giving the benchmark the maximum possible headroom and protecting it from immediate saturation ^[2].

The authors also describe the set as diverse and high-quality, since each question was hand-written and checked rather than scraped or templated ^[1]. Quality control continued after release. In the first week the team ran a public red-teaming effort to surface mistakes in the questions and answers, acknowledging that some errors likely remain in a hand-built set of this kind ^[2]^[4]. The data, code, and red-teaming materials were released openly so others can probe and extend the benchmark ^[4]^[6].

The asterisk in the paper's own title, "An Impossible* Visual Benchmark," is a wink at the obvious caveat: impossible today does not mean impossible forever. The claim is about contemporary models at the moment of release, not a permanent ceiling ^[1]^[6].

Relation to MMMU and other multimodal benchmarks

ZeroBench is best understood as a reaction to the benchmarks it sits beside. MMMU and its tougher follow-up MMMU-Pro test broad, expert-level knowledge across many subjects, with thousands of questions drawn from exams and textbooks ^[3]. They are large and comprehensive, and that breadth is exactly why they saturate: once a model has absorbed enough of the world's text and diagrams, much of the set becomes answerable, and scores pile up near the top ^[8]. Other visual benchmarks that came before, covering chart reading, document understanding, and general multimodal question answering, followed the same arc.

ZeroBench inverts several of those choices. It is deliberately small rather than comprehensive, it targets perception and multi-step visual reasoning rather than encyclopedic knowledge, and it is calibrated to defeat the best models rather than to be answerable by good ones. It does not try to replace MMMU as a measure of broad competence. It serves a narrower purpose: a high-headroom stress test that stays discriminative precisely in the regime where the big knowledge benchmarks have gone quiet. The two are complementary, one mapping general breadth and the other marking the current frontier of hard visual reasoning ^[1]^[3].

Limitations

The authors are candid about the trade-offs. The most obvious is size. With only 100 main questions, the benchmark gives a coarse signal, and the subquestions exist largely to recover finer resolution; a model that improves by a few questions moves the headline number by whole percentage points ^[1]^[2]. A small hand-curated set is also more exposed to residual errors in individual questions, which is why the red-teaming pass was part of the release ^[4].

The deeper limitation is built into the premise. A benchmark calibrated to be impossible for today's models is, by construction, a moving target, and the authors expect scores to climb as models gain better visual resolution and stronger reasoning ^[1]. The 0.0% headline is a snapshot, not a property of the task, and it will erode the moment a new model learns to see a little more carefully. There is also a practical worry that publishing the questions and answers openly can shorten a benchmark's life if the data leaks into training sets, a hazard ZeroBench shares with every public benchmark ^[1]. Its makers seem to accept all of this. The point was never a permanent wall but a temporary one set high enough to be useful while it lasts, and a clear way to watch the floor rise underneath it.

References

Roberts, J., Taesiri, M. R., et al. "ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models." arXiv preprint arXiv:2502.09696, 2025. https://arxiv.org/abs/2502.09696 ↩
Roberts, J., et al. "ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models" (HTML version). arXiv, 2025. https://arxiv.org/html/2502.09696v1 ↩
Yue, X., et al. "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI." CVPR, 2024. https://mmmu-benchmark.github.io/ ↩
Roberts, J. "zerobench: Code, Data and Red Teaming for ZeroBench." GitHub repository, 2025. https://github.com/jonathan-roberts1/zerobench ↩
"ZeroBench dataset." Hugging Face Datasets, 2025. https://huggingface.co/datasets/jonathan-roberts1/zerobench
"ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models." Project website. https://zerobench.github.io/ ↩
Roberts, J. "Jonathan Roberts, Machine Intelligence Laboratory, University of Cambridge." Personal homepage. https://jonathanroberts42.github.io/ ↩
"MMMU: Massive Multi-discipline Multimodal Understanding Benchmark." LLMIndex benchmark page, 2024. https://llmindex.net/benchmarks/mmmu ↩
Roberts, J., et al. "ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models." Semantic Scholar entry, 2025. https://www.semanticscholar.org/paper/7deaa55c18b1cbb315df8ac7a05441c0ec6b38a0
"Paper page: ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models." Hugging Face Papers, 2025. https://huggingface.co/papers/2502.09696

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

CLIP Score

Why another visual benchmark

How the questions were built

The near-zero results

What the errors reveal

Lightweight but impossible

Relation to MMMU and other multimodal benchmarks

Limitations

References

Improve this article

Related Articles

Fox (benchmark)

Visual Question Answering Models

CLIP Score

MMMU-Pro

EgoSchema

Video-MME

What links here

Related Articles

Fox (benchmark)

Visual Question Answering Models

CLIP Score

MMMU-Pro

EgoSchema

Video-MME