MM-Vet
Last reviewed
Jun 8, 2026
Sources
5 citations
Review status
Source-backed
Revision
v1 · 1,502 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
5 citations
Review status
Source-backed
Revision
v1 · 1,502 words
Add missing citations, update stale details, or suggest a clearer explanation.
MM-Vet is an AI benchmark for evaluating large multimodal models (LMMs, also called multimodal large language models or MLLMs) on tasks that require combining several core vision-language skills at once. It was introduced in the 2023 paper "MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities" by Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang, a collaboration between the National University of Singapore and Microsoft [1]. The benchmark was posted to arXiv in August 2023 and the paper was later published at the International Conference on Machine Learning (ICML) in 2024 [1][3].
The name "MM-Vet" stands for "Multimodal Veterinarian," reflecting the goal of diagnosing the integrated capabilities of multimodal systems [1]. Rather than testing a single skill in isolation, MM-Vet defines six core vision-language capabilities and 16 combinations of them, poses open-ended questions about images, and grades the free-form answers with a large language model acting as a judge. The result is a single 0 to 100 score that is comparable across very different answer styles and question types [1]. A follow-up version, MM-Vet v2, was released in August 2024 and extended the benchmark to interleaved image-and-text sequences [2].
MM-Vet became one of the standard reference benchmarks for LMM evaluation alongside MMMU, MMBench, MME, and SEED-Bench, and its single-number score is widely cited in model release reports and technical papers [1][2].
The authors observed that contemporary multimodal models could already perform striking feats, such as solving a math problem written on a blackboard, reasoning about events and people in a news photograph, or explaining why a meme is funny [1]. These behaviors are interesting precisely because they require the model to fuse multiple distinct competencies in a single response. Explaining a visual joke, for example, demands object recognition, world knowledge, and fluent language generation together.
The paper frames three open problems that MM-Vet is designed to address [1]:
MM-Vet answers the first problem by decomposing complex tasks into a small set of core capabilities and their integrations, and the second by using an LLM-based open-ended grader instead of exact-match or multiple-choice scoring. Many earlier multimodal benchmarks relied on closed-form answers (for example, multiple choice or yes/no) that do not reflect how generative assistants actually respond in open-ended conversation [1].
MM-Vet defines six core vision-language (VL) capabilities [1]:
| Capability | What it tests |
|---|---|
| Recognition | General visual recognition: objects, attributes, scenes, counting, and identifying people or things |
| OCR | Optical character recognition and scene-text reading, including text in natural images and documents |
| Knowledge | Social and visual commonsense, encyclopedic knowledge, and time-sensitive knowledge about the world |
| Language generation | Producing clear, fluent, and informative free-form text responses |
| Spatial awareness | Understanding spatial relationships among objects and among text regions in an image |
| Math | Arithmetic and quantitative reasoning over equations and written problems |
From these six capabilities, the benchmark derives 16 combinations, or integrations, that appear in its questions [1]. Most questions require more than one capability simultaneously. For instance, explaining a visual joke combines recognition, knowledge, and language generation; reading a floor plan and computing a quantity combines OCR, spatial awareness, and math; and answering an exam question that includes a diagram combines recognition, OCR, and spatial awareness [1]. Reporting scores per capability and per integration lets the benchmark show where a model's strengths and weaknesses lie, not just its overall standing.
The version 1 dataset is compact by design. It contains 200 images paired with 218 questions, where 155 of the questions were carefully written and annotated by the authors and the remainder were drawn from online sources [1]. Each question has a ground-truth answer used by the grader.
The defining feature of MM-Vet is its use of an LLM-as-a-judge grader to score open-ended responses. Because answers can vary widely in length, phrasing, and degree of completeness, exact string matching is inadequate. Instead, MM-Vet prompts GPT-4 with the question, the ground-truth answer, and the model's prediction, and asks it to return a correctness score [1].
The grader uses few-shot prompting, with a fixed set of in-context examples (seven in the original release) that span short and long answers and partial as well as complete correctness, so the model learns to assign graded, fractional credit rather than a strict pass or fail [1]. Each sample receives a soft score between 0 and 1.0. The benchmark then averages these per-sample scores and reports the result on a 0 to 100 scale, following the formula S = (sum of the per-sample scores / number of samples) times 100 [1]. This produces one unified number that is comparable across answer styles, question types, and models, while still allowing per-capability breakdowns. The approach is part of a broader shift in language and multimodal evaluation toward model-graded benchmarks, related in spirit to text-only judging setups such as MT-Bench.
In the original paper, GPT-4V (the vision-enabled version of GPT-4) achieved the top overall score of 67.7 percent, well ahead of other systems of the time [1]. The tool-using pipeline MM-ReAct-GPT-4 scored about 44.6 percent, and open-source LLaVA-13B variants scored roughly 32 to 33 percent [1]. The wide gap underscored how challenging integrated, open-ended multimodal reasoning was for the open models of 2023.
In August 2024, the same core team (with additional authors Lingfeng Ren and Chung-Ching Lin) released "MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities" [2]. A key limitation of the original benchmark was that every question concerned a single image-text pair. MM-Vet v2 adds a seventh core capability, image-text sequence understanding, which evaluates a model's ability to process interleaved sequences of images and text, closer to multi-image and document-style inputs encountered in real use [2].
MM-Vet v2 also substantially expands the evaluation set while preserving sample quality. It contains 517 questions, of which the 218 questions from version 1 are carried over, covering scenarios that range from everyday life to expert and industry applications [2]. The LLM-graded, open-ended scoring methodology is retained.
Reported results on MM-Vet v2 reflect the stronger frontier models of 2024 [2]:
| Model | MM-Vet v2 score |
|---|---|
| Claude 3.5 Sonnet | 71.8 |
| GPT-4o | 71.0 |
| InternVL2-Llama3-76B | 68.4 |
| Gemini 1.5 Pro | 66.9 |
| GPT-4V | 66.3 |
In the paper's evaluation, Claude 3.5 Sonnet was the best model at 71.8, narrowly ahead of GPT-4o at 71.0, while InternVL2-Llama3-76B led the open-weight models at 68.4 [2]. The lower absolute scores compared with version 1's headline number reflect that v2 is a deliberately harder and broader test set.
MM-Vet helped popularize two ideas that became common in multimodal evaluation: defining tasks by the combination of core capabilities they require, and grading open-ended generative answers with an LLM judge rather than forcing closed-form responses [1]. This made the benchmark a better proxy for how conversational multimodal assistants behave in practice, where users ask varied, unconstrained questions about images.
Because MM-Vet outputs a single, easily quoted number together with a per-capability profile, it was adopted as a routine yardstick in LMM development and frequently appears in model cards and research papers next to complementary benchmarks. It is often contrasted with MMMU, a much larger college-exam-style benchmark of about 11,500 multiple-choice questions across many academic disciplines, and with MMBench, MME, and SEED-Bench, which use different formats and emphases [1][4]. Where MMMU stresses breadth of expert knowledge under a structured answer format, MM-Vet emphasizes the integration of everyday vision-language skills under open-ended scoring. The two are widely reported together precisely because they probe different aspects of multimodal competence. The benchmark's code, data, and an online leaderboard are publicly available, which contributed to its broad uptake [1].