BLINK
Last reviewed
Sources
4 citations
Review status
Source-backed
Revision
v1 · 1,687 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
4 citations
Review status
Source-backed
Revision
v1 · 1,687 words
Add missing citations, update stale details, or suggest a clearer explanation.
BLINK is an AI benchmark that evaluates the core visual perception abilities of multimodal large language models (MLLMs). It reformats 14 classic computer vision tasks, such as relative depth estimation, visual correspondence, and jigsaw-puzzle solving, into multiple-choice visual questions. The benchmark is built around tasks that people can solve almost instantly, "within a blink" of an eye, but that current multimodal models handle poorly. Its central finding is that while humans answer these questions with about 95.7 percent accuracy, the strongest multimodal models of early 2024, including GPT-4V and Gemini Pro, scored only about 45 to 51 percent, barely above random guessing and far below specialized vision models [1][2].
BLINK was introduced in the April 2024 paper "BLINK: Multimodal Large Language Models Can See but Not Perceive" by Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A. Smith, Wei-Chiu Ma, and Ranjay Krishna [1]. The authors are affiliated with the University of Pennsylvania, the University of Washington, the Allen Institute for AI (AI2), the University of California, Davis, Columbia University, and Cornell University [2]. The work was accepted to the European Conference on Computer Vision (ECCV) 2024 [3]. The dataset and evaluation code are publicly released, with the data hosted on Hugging Face and the code on GitHub [2][4].
The benchmark's title captures its thesis: multimodal models can "see," in the sense of processing images and answering high-level questions about them, but they cannot reliably "perceive," meaning they fail at the low-level geometric and visual judgments that underpin everyday human vision [1].
By early 2024, multimodal models had become strong at high-level visual question answering. They could describe scenes, read text in images, reason about charts, and answer open-ended questions about photographs. Many existing benchmarks at the time rewarded exactly this kind of language-mediated reasoning, where a model can often succeed by combining broad world knowledge with a rough read of the image [1].
The BLINK authors argue that this success masked a gap. A large body of classic computer vision concerns perception tasks that are visual at their core and resist being "talked through" in natural language. Judging which of two points in a photo is closer to the camera, matching the same physical point across two viewpoints, or deciding whether a face has been digitally manipulated are problems that humans solve effortlessly and that dedicated vision systems handle well, yet they are awkward to express as a chain of words. The paper frames these as perception-demanding tasks that "resist mediation through natural language" [1].
To test whether modern MLLMs had acquired such perception, the authors deliberately assembled tasks drawn from traditional computer vision rather than from the kind of semantic, knowledge-heavy questions that dominated other suites. The goal was to isolate raw visual ability from the linguistic and encyclopedic knowledge that lets a model paper over weak perception. The result was a benchmark on which humans are near-perfect but leading models stumble, exposing that these perceptual abilities have not yet "emerged" in multimodal LLMs the way many higher-level capabilities have [1].
BLINK consists of 3,807 multiple-choice questions built over 7,358 images [1][2]. Every question is posed as a multiple-choice problem, which makes scoring objective and lets the benchmark report a single accuracy number per task and overall [1].
Two design features distinguish BLINK from earlier visual question-answering benchmarks. First, many questions require reasoning over more than one image: 8 of the 14 tasks take multiple images as input, and 2,218 of the questions involve multiple images. Second, BLINK incorporates visual prompts drawn directly onto the images, such as colored circles, bounding boxes, and image masks, rather than relying only on text questions and answers. In total, 1,946 questions include such visual prompts. These choices reflect the benchmark's focus on perception: pointing at a location in an image with a circle, or comparing two viewpoints, is often more natural visually than describing it in words [1][2].
The dataset is split into a validation set and a test set of roughly equal size, with about 1,901 validation questions and 1,907 test questions [4]. Answers for the validation set are released publicly, while the test set is used for the official leaderboard [2][4].
BLINK reformats 14 classic computer vision problems into its multiple-choice format. The table below lists each task and what it asks a model to judge [1][2].
| Task | What it tests |
|---|---|
| Relative Depth | Which of two marked points is closer to the camera |
| Spatial Relation | The spatial arrangement of objects in a scene |
| Object Localization | Where an object is located, often relative to marked regions |
| Counting | How many instances of an object appear |
| Visual Correspondence | Matching the same physical point across two images |
| Semantic Correspondence | Matching semantically equivalent parts across different objects |
| Functional Correspondence | Matching points that serve the same function across objects |
| Multi-view Reasoning | Reasoning about a scene seen from different viewpoints |
| Relative Reflectance | Judging the intrinsic reflectance (albedo) of surfaces under shading |
| Forensic Detection | Detecting whether an image has been manipulated or is synthetic |
| Art Style | Deciding whether images share the same artistic style |
| Visual Similarity | Choosing which candidate image is most similar to a reference |
| IQ Test | Solving IQ-test-style visual pattern and analogy puzzles |
| Jigsaw | Selecting the correct piece to complete a jigsaw of an image |
Several of these tasks are reformulations of long-standing computer vision problems with established specialist models. Relative depth relates to monocular depth estimation, visual and semantic correspondence to keypoint matching, relative reflectance to intrinsic image decomposition, and forensic detection to image-manipulation and deepfake detection. By casting them as multiple-choice questions, BLINK makes it possible to evaluate a general-purpose language model on the same underlying capability that a dedicated vision network targets [1].
On BLINK, human annotators reach an average accuracy of 95.7 percent across the 14 tasks, confirming that the questions are easy for people [1]. The leading multimodal models evaluated in the original paper fell far short. GPT-4V (the GPT-4 vision-preview model) scored 51.26 percent, and Gemini Pro scored 45.72 percent. Because the questions vary in the number of answer choices, the expected score from random guessing is 38.09 percent, so GPT-4V was only 13.17 percent above chance and Gemini Pro only 7.63 percent above chance [1][2]. Other systems tested, including Claude 3 Opus and the open-source LLaVA-1.6-34B model, landed in a similar middling range in the mid-40s, while smaller open-source 7B and 13B models performed closer to the random baseline [2].
The headline comparison from the paper is summarized below [1][2].
| System | BLINK accuracy | Margin over random (38.09%) |
|---|---|---|
| Human | 95.70% | +57.61% |
| GPT-4V (vision preview) | 51.26% | +13.17% |
| Gemini Pro 1.0 | 45.72% | +7.63% |
| Random guessing | 38.09% | 0% |
A second comparison drives the point home. On the individual perception tasks, dedicated specialist vision models substantially outperform the multimodal LLMs, beating the best MLLM by roughly 18 to 57 percent depending on the task. For example, the DepthAnything model reaches about 97.58 percent on the relative-depth task, compared with GPT-4V's 59.68 percent [1]. The gap shows that the capability itself is well within reach of computer vision systems; it is the general-purpose multimodal models that have not internalized it.
The benchmark's leaderboard has continued to track newer models since publication. GPT-4o, released in May 2024, scored about 59.0 percent on the BLINK test set, and GPT-4 Turbo about 53.9 percent, improvements over the original GPT-4V figure but still far below the human level of 95.7 percent [2]. The persistence of this gap across model generations is part of what has kept BLINK in use as a perception probe.
BLINK has been widely cited as evidence of a specific blind spot in multimodal models: strong language-grounded reasoning paired with weak low-level visual perception. By choosing tasks that are trivial for humans and well-solved by specialist vision networks, the benchmark isolates perception from the world knowledge and language skills that can otherwise mask a model's visual weaknesses. The large human-to-model gap, and the even larger gap relative to dedicated vision models, gave researchers a concrete target for improving the visual grounding of MLLMs rather than only their language behavior [1].
The benchmark also helped popularize two methodological ideas in multimodal evaluation. The first is the systematic use of visual prompts, such as circles, boxes, and masks drawn onto images, as part of the question itself, which lets a benchmark ask precise spatial questions that are hard to phrase in text alone. The second is the routine use of multi-image questions, including multi-view and correspondence problems, at a time when many models and benchmarks still assumed a single input image [1].
Within the broader landscape of multimodal benchmarks, BLINK occupies a complementary niche to high-level visual question-answering suites such as MMMU and MME, which emphasize knowledge, reasoning, and discipline-specific understanding. BLINK instead stresses the perceptual substrate beneath those abilities. It has since been used as a standard visual-perception evaluation in model reports and in follow-on research on spatial reasoning and visual grounding, and it remains a reference point for the argument that multimodal models can see but do not yet perceive [1][2].