MM-Vet

AI Benchmarks Computer Vision

8 min read

Updated Jun 8, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 8, 2026

Fact-checked

In review queue

Sources

5 citations

Revision

v1 · 1,502 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Overview

MM-Vet is an AI benchmark for evaluating large multimodal models (LMMs, also called multimodal large language models or MLLMs) on tasks that require combining several core vision-language skills at once. It was introduced in the 2023 paper "MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities" by Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang, a collaboration between the National University of Singapore and Microsoft ^[1]. The benchmark was posted to arXiv in August 2023 and the paper was later published at the International Conference on Machine Learning (ICML) in 2024 ^[1]^[3].

The name "MM-Vet" stands for "Multimodal Veterinarian," reflecting the goal of diagnosing the integrated capabilities of multimodal systems ^[1]. Rather than testing a single skill in isolation, MM-Vet defines six core vision-language capabilities and 16 combinations of them, poses open-ended questions about images, and grades the free-form answers with a large language model acting as a judge. The result is a single 0 to 100 score that is comparable across very different answer styles and question types ^[1]. A follow-up version, MM-Vet v2, was released in August 2024 and extended the benchmark to interleaved image-and-text sequences ^[2].

MM-Vet became one of the standard reference benchmarks for LMM evaluation alongside MMMU, MMBench, MME, and SEED-Bench, and its single-number score is widely cited in model release reports and technical papers ^[1]^[2].

Motivation: integrated capabilities

The authors observed that contemporary multimodal models could already perform striking feats, such as solving a math problem written on a blackboard, reasoning about events and people in a news photograph, or explaining why a meme is funny ^[1]. These behaviors are interesting precisely because they require the model to fuse multiple distinct competencies in a single response. Explaining a visual joke, for example, demands object recognition, world knowledge, and fluent language generation together.

The paper frames three open problems that MM-Vet is designed to address ^[1]:

How to systematically structure and define the wide range of complicated multimodal tasks a model might face.
How to design an evaluation metric that works across many different question and answer formats, since free-form answers can be short or long, partially or fully correct.
How to produce insights that go beyond a single performance ranking, so that developers can see which specific capabilities a model is strong or weak in.

MM-Vet answers the first problem by decomposing complex tasks into a small set of core capabilities and their integrations, and the second by using an LLM-based open-ended grader instead of exact-match or multiple-choice scoring. Many earlier multimodal benchmarks relied on closed-form answers (for example, multiple choice or yes/no) that do not reflect how generative assistants actually respond in open-ended conversation ^[1].

The six capabilities and their integrations

MM-Vet defines six core vision-language (VL) capabilities ^[1]:

Capability	What it tests
Recognition	General visual recognition: objects, attributes, scenes, counting, and identifying people or things
OCR	Optical character recognition and scene-text reading, including text in natural images and documents
Knowledge	Social and visual commonsense, encyclopedic knowledge, and time-sensitive knowledge about the world
Language generation	Producing clear, fluent, and informative free-form text responses
Spatial awareness	Understanding spatial relationships among objects and among text regions in an image
Math	Arithmetic and quantitative reasoning over equations and written problems

From these six capabilities, the benchmark derives 16 combinations, or integrations, that appear in its questions ^[1]. Most questions require more than one capability simultaneously. For instance, explaining a visual joke combines recognition, knowledge, and language generation; reading a floor plan and computing a quantity combines OCR, spatial awareness, and math; and answering an exam question that includes a diagram combines recognition, OCR, and spatial awareness ^[1]. Reporting scores per capability and per integration lets the benchmark show where a model's strengths and weaknesses lie, not just its overall standing.

The version 1 dataset is compact by design. It contains 200 images paired with 218 questions, where 155 of the questions were carefully written and annotated by the authors and the remainder were drawn from online sources ^[1]. Each question has a ground-truth answer used by the grader.

LLM-based evaluation

The defining feature of MM-Vet is its use of an LLM-as-a-judge grader to score open-ended responses. Because answers can vary widely in length, phrasing, and degree of completeness, exact string matching is inadequate. Instead, MM-Vet prompts GPT-4 with the question, the ground-truth answer, and the model's prediction, and asks it to return a correctness score ^[1].

The grader uses few-shot prompting, with a fixed set of in-context examples (seven in the original release) that span short and long answers and partial as well as complete correctness, so the model learns to assign graded, fractional credit rather than a strict pass or fail ^[1]. Each sample receives a soft score between 0 and 1.0. The benchmark then averages these per-sample scores and reports the result on a 0 to 100 scale, following the formula S = (sum of the per-sample scores / number of samples) times 100 ^[1]. This produces one unified number that is comparable across answer styles, question types, and models, while still allowing per-capability breakdowns. The approach is part of a broader shift in language and multimodal evaluation toward model-graded benchmarks, related in spirit to text-only judging setups such as MT-Bench.

In the original paper, GPT-4V (the vision-enabled version of GPT-4) achieved the top overall score of 67.7 percent, well ahead of other systems of the time ^[1]. The tool-using pipeline MM-ReAct-GPT-4 scored about 44.6 percent, and open-source LLaVA-13B variants scored roughly 32 to 33 percent ^[1]. The wide gap underscored how challenging integrated, open-ended multimodal reasoning was for the open models of 2023.

MM-Vet v2

In August 2024, the same core team (with additional authors Lingfeng Ren and Chung-Ching Lin) released "MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities" ^[2]. A key limitation of the original benchmark was that every question concerned a single image-text pair. MM-Vet v2 adds a seventh core capability, image-text sequence understanding, which evaluates a model's ability to process interleaved sequences of images and text, closer to multi-image and document-style inputs encountered in real use ^[2].

MM-Vet v2 also substantially expands the evaluation set while preserving sample quality. It contains 517 questions, of which the 218 questions from version 1 are carried over, covering scenarios that range from everyday life to expert and industry applications ^[2]. The LLM-graded, open-ended scoring methodology is retained.

Reported results on MM-Vet v2 reflect the stronger frontier models of 2024 ^[2]:

Model	MM-Vet v2 score
Claude 3.5 Sonnet	71.8
GPT-4o	71.0
InternVL2-Llama3-76B	68.4
Gemini 1.5 Pro	66.9
GPT-4V	66.3

In the paper's evaluation, Claude 3.5 Sonnet was the best model at 71.8, narrowly ahead of GPT-4o at 71.0, while InternVL2-Llama3-76B led the open-weight models at 68.4 ^[2]. The lower absolute scores compared with version 1's headline number reflect that v2 is a deliberately harder and broader test set.

Significance

MM-Vet helped popularize two ideas that became common in multimodal evaluation: defining tasks by the combination of core capabilities they require, and grading open-ended generative answers with an LLM judge rather than forcing closed-form responses ^[1]. This made the benchmark a better proxy for how conversational multimodal assistants behave in practice, where users ask varied, unconstrained questions about images.

Because MM-Vet outputs a single, easily quoted number together with a per-capability profile, it was adopted as a routine yardstick in LMM development and frequently appears in model cards and research papers next to complementary benchmarks. It is often contrasted with MMMU, a much larger college-exam-style benchmark of about 11,500 multiple-choice questions across many academic disciplines, and with MMBench, MME, and SEED-Bench, which use different formats and emphases ^[1]^[4]. Where MMMU stresses breadth of expert knowledge under a structured answer format, MM-Vet emphasizes the integration of everyday vision-language skills under open-ended scoring. The two are widely reported together precisely because they probe different aspects of multimodal competence. The benchmark's code, data, and an online leaderboard are publicly available, which contributed to its broad uptake ^[1].

References

Yu, Weihao; Yang, Zhengyuan; Li, Linjie; Wang, Jianfeng; Lin, Kevin; Liu, Zicheng; Wang, Xinchao; Wang, Lijuan. "MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities." arXiv:2308.02490, August 2023. https://arxiv.org/abs/2308.02490 ↩
Yu, Weihao; Yang, Zhengyuan; Ren, Lingfeng; Li, Linjie; Wang, Jianfeng; Lin, Kevin; Lin, Chung-Ching; Liu, Zicheng; Wang, Lijuan; Wang, Xinchao. "MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities." arXiv:2408.00765, August 2024. https://arxiv.org/abs/2408.00765 ↩
Proceedings of the 41st International Conference on Machine Learning (ICML 2024). "MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities." https://proceedings.mlr.press/v235/yu24o.html ↩
Yue, Xiang; et al. "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI." CVPR 2024. https://mmmu-benchmark.github.io/ ↩
MM-Vet code, data, and leaderboard. GitHub repository. https://github.com/yuweihao/MM-Vet

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

Visual Question Answering Models

Overview

Motivation: integrated capabilities

The six capabilities and their integrations

LLM-based evaluation

MM-Vet v2

Significance

References

Improve this article

Related Articles

Fox (benchmark)

Visual Question Answering Models

Frechet Inception Distance

CLIP Score

MMMU-Pro

EgoSchema

What links here

Related Articles

Fox (benchmark)

Visual Question Answering Models

Frechet Inception Distance

CLIP Score

MMMU-Pro

EgoSchema