MMStar

AI Benchmarks Multimodal AI

17 min read

Updated Jul 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 23, 2026

Fact-checked

In review queue

Sources

9 citations

Revision

v4 · 3,413 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

MMStar (Multi-modal Star) is a vision-language model evaluation benchmark consisting of 1,500 multimodal samples that were filtered from six pre-existing benchmarks to ensure both visual dependency (questions that cannot be answered by a text-only large language model exploiting spurious cues) and minimal text leakage (questions unlikely to have appeared verbatim in common pretraining corpora).^[1] The benchmark was introduced in the paper "Are We on the Right Way for Evaluating Large Vision-Language Models?" by Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, and Feng Zhao, a collaboration between the University of Science and Technology of China (USTC), The Chinese University of Hong Kong, and Shanghai AI Laboratory.^[1]^[2] The work was first posted to arXiv on 29 March 2024 (revised 9 April 2024) and was accepted at the NeurIPS 2024 conference.^[1]^[3] MMStar evaluates a model along six core capability axes and 18 detailed sub-axes, and introduces two diagnostic metrics, multi-modal gain (MG) and multi-modal leakage (ML), to disentangle genuine visual reasoning from text-side memorization.^[1]^[2]

Infobox

Field	Value
Full name	MMStar (Multi-modal Star)
Type	Vision-language model evaluation benchmark
Sample count	1,500 multiple-choice samples
Core capabilities	6 (coarse perception, fine-grained perception, instance reasoning, logical reasoning, science & technology, mathematics)
Detailed axes	18
Source benchmarks	MMBench, SEED-Bench, MMMU, ScienceQA, AI2D, MathVista
New metrics introduced	Multi-modal Gain (MG), Multi-modal Leakage (ML)
Creators	Lin Chen et al. (USTC, CUHK, Shanghai AI Laboratory)
Paper	arXiv:2403.20330
First arXiv submission	29 March 2024
Accepted at	NeurIPS 2024
Project page	mmstar-benchmark.github.io
Code repository	github.com/MMStar-Benchmark/MMStar
Dataset hosting	Hugging Face (Lin-Chen/MMStar)
License	CC BY 4.0 (paper); dataset terms per Hugging Face

Background

By early 2024 a large number of multimodal evaluation suites had been published to probe the rapidly growing class of large vision-language models (LVLMs), including MMMU, MMBench, SEED-Bench, ScienceQA, AI2D, and MathVista.^[1] Most of these suites consisted of multiple-choice or short-answer items grounded in natural images, scientific diagrams, charts, or screenshots, and they served as the de facto comparison ground for models such as GPT-4V, Gemini Pro Vision, LLaVA variants, and InternVL-Chat.^[1] However, the MMStar authors observed two structural issues that called into question whether reported headline numbers actually measured multi-modal reasoning.

The first issue was vision independence: in many benchmarks the textual question and answer options encoded enough information that a language-only model could answer correctly without ever seeing the image.^[1] Two failure modes contributed to this. In one, the answer was contained in the question itself (for example "What is the shape of the round dirt circle?" trivially admits the answer "circle"); in another, the question relied on world knowledge already memorised by a sufficiently large text model (for example "What is the capital of Nebraska?" requires no visual input even when a map is shown). The MMStar paper showed that on six widely cited benchmarks a closed-source LLM such as Gemini Pro could exceed random baselines by more than 24 percentage points on average without any visual input, and reached 42.9% on MMMU under 0-shot inference and 42.7% under 2-shot.^[1] On ScienceQA and AI2D the team measured "abnormal hit rates" (samples solved by 80% or more of LLM inspectors with text only) of 57.2% and 46.2% respectively.^[1]

The second issue was unintentional data leakage, sometimes called test set contamination. Large language models and large vision-language models are pretrained on web-scale corpora that, in practice, can include or paraphrase benchmark questions. The MMStar authors found that when running LVLMs with the image input removed (a setting they label "LVLM-text"), some models still answered visual-necessary questions correctly at rates well above their underlying LLM backbones.^[1] In one striking case, Sphinx-X-MoE scored 43.6% on MMMU without accessing any image, which was 17.9 percentage points higher than its language-only backbone and exceeded the with-image accuracy of many competing LVLMs.^[1] Yi-VL-34B was 15 percentage points stronger than its LLM backbone on ScienceQA even without images.^[1] These two effects, combined, meant that an apparent ranking on a popular VLM leaderboard could reflect text-side memorisation as much as visual reasoning.

To address both problems jointly, the MMStar authors proposed: (1) a curated benchmark of samples that pass both filters, and (2) two metrics that quantify, per model, how much of a benchmark score is actually attributable to multi-modal training versus to text-side priors.^[1]

Timeline and release

Date	Event
29 March 2024	arXiv v1 of "Are We on the Right Way for Evaluating Large Vision-Language Models?" (arXiv:2403.20330)^[1]
1 April 2024	Paper announcement and project page launch^[2]
2 April 2024	Public release of code repository and Hugging Face dataset^[2]^[4]
9 April 2024	arXiv v2 (revised text, updated tables)^[1]
16 April 2024	Integration into VLMEvalKit and the OpenCompass multi-modal leaderboard^[2]^[5]
25-26 September 2024	NeurIPS 2024 acceptance and OpenReview publication^[3]

Data curation pipeline

MMStar is not a benchmark of newly authored questions. Following the authors' phrase "Stands on the shoulders of giants," it is a curated subset distilled from existing benchmarks.^[1]

Source benchmarks

The team began with six existing multi-modal benchmarks selected to span both natural images and scientific or technical knowledge.^[1]

Source benchmark	Focus	Citation in MMStar paper
MMBench	General multi-modal capability, 20 dimensions	Liu et al., 2023
SEED-Bench	12 evaluation dimensions, image and video	Li et al., 2023
MMMU	College-level multi-discipline, 30 subjects	Yue et al., 2023
ScienceQA	Science domain QA with diagrams	Lu et al., 2022
AI2D	Grade-school science diagrams	Kembhavi et al., 2016
MathVista (Mini)	Mathematical visual reasoning	Lu et al., 2023

The initial union of these sources contained 22,401 candidate samples.^[1]

Coarse automated filtering

The first stage used eight LLM "inspectors" to identify samples that could be solved without vision. The panel consisted of two closed-source models (GPT-4 Turbo and Gemini Pro) and six open-source models with at least 20 billion parameters (InternLM2, Yi-34B, Mixtral-8x7B, DeepSeek-67B, LLaMA-70B, and Qwen1.5-72B).^[1] Each inspector attempted to answer every candidate sample in a 2-shot in-context configuration without access to the image; the 2-shot setting was chosen because it reduced answer-format refusals compared with 0-shot evaluation.^[1] A sample was kept only if at most two of the eight inspectors answered correctly, meaning that roughly 75% of language-only inspectors failed on it. After this filter, the candidate pool was reduced from 22,401 to 11,607 samples.^[1]

Difficulty stratification with LVLMs

The same 11,607 candidates were then put through 16 leading vision-language models (with images this time) to estimate difficulty. The hit count per sample was recorded and used to split samples into four difficulty bins: easy (12-16 LVLMs correct), moderate (8-11), hard (4-7), and tough (0-3).^[1] This stratification was important because the authors wanted MMStar to span the full difficulty range rather than concentrating only on items that all strong models solve or none solve.

Manual review

Three human experts then conducted a manual pass to enforce three criteria: each retained sample's answer must genuinely require visual content; the final set must cover a comprehensive range of capability axes; and most samples must require advanced multi-modal capabilities for resolution.^[1] After balancing across both capability dimensions and difficulty levels, the experts retained 1,500 high-quality samples, 250 per core capability.^[1]

Capability taxonomy

MMStar organises its 1,500 samples in a two-level taxonomy: six core capabilities, each subdivided into three detailed axes for a total of 18 axes.^[1]

Core capability (250 samples each)	Detailed axes
Coarse Perception (CP)	Image style & quality; image scene & topic; image emotion
Fine-grained Perception (FP)	Attribute & celebrity recognition; object location; object counting
Instance Reasoning (IR)	Single-instance attribute reasoning; cross-instance attribute comparison; cross-instance relation reasoning
Logical Reasoning (LR)	Diagram reasoning; code & sequence reasoning; common reasoning
Science & Technology (ST)	Natural science; engineering; geography & earth science
Mathematics (MA)	Numeric commonsense & calculation; geometry; statistical analysis

Each core capability contains exactly 250 samples, and the paper notes that distribution across the 18 detailed axes is "relatively even" rather than mathematically equal, with three experts balancing both axis coverage and difficulty stratification.^[1] All items are single-image, multiple-choice questions with four options, and the dataset distribution on Hugging Face records 1,500 rows in Parquet format, each containing index, question (with options A/B/C/D embedded), image bytes, gold answer, category, l2_category (the detailed axis), and meta_info pointing back to the source benchmark.^[4]

A central methodological contribution of the paper is a pair of benchmark-specific metrics meant to be reported alongside raw accuracy.^[1]

For a given LVLM, the authors define three quantities:

S_v: the LVLM's accuracy on a benchmark with full visual input.
S_wv: the same LVLM's accuracy on the same benchmark but with images removed (only the textual question and options are passed through the visual pathway placeholder).
S_t: the accuracy of the LVLM's underlying language model backbone, evaluated without any multi-modal training and without images.

The two metrics are then defined as:

Multi-modal Gain (MG) = S_v - S_wv. This measures how much the model gains from actually seeing the image, controlling for whatever it can already do from text alone after multi-modal training.^[1]
Multi-modal Leakage (ML) = max(0, S_wv - S_t). This measures how much the LVLM-text setting outperforms the original LLM backbone, with leakage clamped at zero (the max with zero) since a multi-modal fine-tune may also degrade text-only performance.^[1]

A model with high MG and low ML is one whose multi-modal training is contributing genuine visual reasoning rather than test-set memorisation. Because MG and ML are computed in a benchmark-specific manner, the paper recommends reporting them on MMStar itself and on each external benchmark a model claims a score on, in order to expose contamination patterns that headline accuracy hides.^[1] The official evaluation protocol requires submitters to upload three result files (LVLM with images, LVLM without images, base LLM without images) so that all three quantities can be computed.^[2]^[5]

Evaluation protocol

MMStar uses a 0-shot multiple-choice protocol in which the model is given the question, four options labelled A through D, and the image, and is expected to output a single letter.^[1]^[2] The benchmark is offline (all 1,500 samples are released; there is no held-out private split), which makes reproducibility easy but also means an LVLM trainer who wishes to game the score can in principle include MMStar samples in training data; the MG/ML metrics partly defend against this by surfacing test leakage relative to the model's own text-only baseline.^[1]

The reference implementation in the MMStar GitHub repository computes accuracy per detailed axis and per core capability, then aggregates to an overall accuracy.^[2] Integration into the OpenCompass-maintained VLMEvalKit means that MMStar can be evaluated as one of more than 80 benchmarks in a unified one-command pipeline, alongside MMBench, SEED-Bench, MMMU, AI2D, MathVista, OCRBench, HallusionBench, and many others.^[5]

Reported results from the original paper

The MMStar paper evaluated 16 leading LVLMs on the 1,500-sample benchmark and reported accuracy together with MG and ML.^[1] Selected results, as reported in the v2 paper and the project leaderboard, are summarised below.

Model	Avg. accuracy	MG	ML
GPT-4V (high resolution)	57.1%	43.6	1.3
InternLM-XComposer2	55.4%	28.1	7.5
LLaVA-Next-34B	52.1%	29.4	2.4
GPT-4V (low resolution)	46.1%	32.6	1.3
InternVL-Chat-v1.2 (40B)	43.7%	32.6	0.0
Gemini Pro Vision	42.6%	27.4	0.0
Sphinx-X-MoE (57B)	38.9%	14.8	1.0
Yi-VL-34B	36.1%	18.8	0.0
LLaVA-1.5-7B	30.3%	10.7	0.0
Random baseline	24.6%	n/a	n/a

Two patterns stand out. First, even the strongest model in the original release (GPT-4V at high resolution) scored 57.1%, well below the 60% mark that the authors describe as a "passing average score" for the four hardest capability dimensions (fine-grained perception, logical reasoning, science & technology, and mathematics).^[1] Second, MG and ML separate models that look similar on accuracy. GPT-4V combined the highest MG (43.6) with one of the smallest MLs (1.3), suggesting both effective multi-modal training and limited leakage. InternLM-XComposer2 achieved high accuracy (55.4%) but with the largest leakage among the listed models (ML = 7.5), indicating that a meaningful portion of its raw score came from non-visual cues. Several open models report ML = 0 because their LVLM-text setting did not outperform the underlying LLM backbone, which the paper interprets as evidence of clean evaluation rather than evidence of high capability.^[1]

The accompanying analysis on six external benchmarks (MMMU, MMBench-EN-Dev, ScienceQA-Test, AI2D-Test, SEED-Image, MathVista-Mini) showed that models such as Sphinx-X-MoE (ML up to 14% on average) and Monkey-Chat (also around 14%) had substantial multi-modal training leakage, while GPT-4V and Gemini Pro Vision had relatively low ML on those external benchmarks.^[1]

Adoption in modern VLM technical reports

Because MMStar is small (1,500 samples), inexpensive to run, and explicitly filtered for visual dependency, it has become a near-default entry in modern VLM evaluation tables. Reports from major model families published between 2024 and 2026 include MMStar scores alongside the older benchmarks the MMStar authors criticised:

The Qwen team reports MMStar scores for both Qwen2-VL and Qwen2.5-VL, with the Qwen2.5-VL technical report (released February 2025) citing MMStar as one of the headline general VLM benchmarks alongside MMMU, MMBench, and MathVista.^[6]
The InternVL family from Shanghai AI Laboratory and partners reports MMStar throughout the InternVL series. The InternVL3.5 technical report (2025) presents MMStar scores for variants up to InternVL3.5-241B-A28B, which it reports at 77.9 on MMStar alongside an overall composite of 84.2 across a multi-benchmark evaluation.^[7]
Public comparisons of GPT-4o, Claude 3.5 Sonnet, and InternVL2 in third-party VLM leaderboards and benchmark papers (for example MM-Vet v2) routinely include MMStar as a complementary metric.^[8]
The Open VLM Leaderboard hosted by OpenCompass on Hugging Face Spaces uses MMStar as one of its component benchmarks for ranking open and closed VLMs.^[5]

Because of these adoption patterns, MMStar functions less as a single ranking and more as an "honest" companion to older benchmarks: reports that omit MMStar are increasingly conspicuous, and reports that present a strong MMStar number with a modest ML are taken as evidence of genuine visual reasoning rather than text-side priors.

Significance

MMStar's significance lies less in being the largest or most expensive multi-modal evaluation and more in formalising, quantifying, and operationalising the critique that vision-language model benchmarks can be solved without vision.^[1] By tying the critique to two concrete diagnostic metrics, MG and ML, the paper changed how subsequent VLM technical reports present their numbers, encouraging the now-common practice of also reporting LVLM-text and LLM-only baselines.^[1]^[6] The downstream effect has been broader scrutiny of training data hygiene for VLMs, with several follow-up benchmarks (such as MMT-Bench and MM-Vet v2) explicitly designing for visual indispensability from the start rather than relying on a post hoc filter.^[9]

The construction methodology (an LLM inspector panel, an LVLM difficulty stratification, and a human expert review) has itself influenced other curated benchmarks built on top of existing data, particularly those targeted at exposing shortcuts in vision-language model evaluation.^[1]

Limitations and criticisms

Several limitations of MMStar are noted by the authors themselves or have been raised in subsequent literature.

Static and public dataset. All 1,500 samples are released publicly, which means future models can inadvertently or deliberately include MMStar in their training corpora. The MG/ML metrics mitigate but do not eliminate this risk, because a sufficiently strong LLM backbone could memorise leaked questions and inflate the S_t baseline, thereby driving ML to zero even when leakage exists.^[1]
Single image per sample. MMStar items are all single-image multiple choice, so the benchmark does not directly probe multi-image reasoning, video understanding, document-level OCR with long context, or interleaved image-text dialogue.^[4]
English-only. The dataset is released in English (Hugging Face language tag: en), and the source benchmarks it draws from are predominantly English.^[4]
Coverage of new modalities. Modern VLMs increasingly handle long videos, audio, and 3D content; MMStar inherits the modality coverage of its 2022-2023 source benchmarks and does not cover these newer modalities.^[1]
Capability bins remain coarse. The six core capabilities deliberately abstract over many specific sub-skills (for example, fine-grained perception lumps celebrity recognition, object localisation, and counting), so a single MMStar number can hide substantial intra-bin variation.^[1]
Filter dependence on the 2024 LLM panel. Vision-independence was determined by an inspector panel of LLMs available in early 2024 (GPT-4 Turbo, Gemini Pro, Mixtral-8x7B, DeepSeek-67B, LLaMA-70B, Qwen1.5-72B, Yi-34B, InternLM2-20B). Stronger models released after the cut-off might solve some "vision-required" MMStar items from text alone, eroding the filter over time.^[1]

These caveats motivate continued reporting of MMStar alongside, not in place of, other targeted benchmarks.

MMStar sits in a cluster of multi-modal benchmarks released between 2023 and 2024 and is best understood as a "purified subset" sibling rather than a replacement for any one of them.

Benchmark	Year	Approx. size	Primary lens	Relation to MMStar
MMMU	2023	~11,500 college-level items	Multi-discipline expert knowledge	Source benchmark; MMMU samples appear in MMStar after filtering
MMBench	2023	~3,200 items in EN-Dev	20-dimension general capability	Source benchmark; natural-image subset feeds MMStar CP/FP
SEED-Bench	2023	~19,000 (incl. video)	12 evaluation dimensions	Source benchmark; image split feeds MMStar
ScienceQA	2022	~21,000 K-12 science items	Diagram + science QA	Source benchmark; high LLM hit rate motivated MMStar
AI2D	2016	~4,500 diagram items	Grade-school diagram QA	Source benchmark; high abnormal hit rate (46.2%)
MathVista	2023	~6,000 (Mini: 1,000)	Mathematical visual reasoning	Source benchmark; provides math axis content
MMStar	2024	1,500	Visual-indispensable, leakage-aware	Distilled meta-benchmark with MG/ML metrics

Among the source benchmarks, MMMU and MathVista remain widely cited as standalone tests, while MMStar functions as a leakage-aware overlay that asks whether an apparent strength on those tests is grounded in vision. The MMMU-Pro benchmark, released later, addressed some of the same concerns by reformatting MMMU into a more vision-dependent variant, complementing rather than superseding MMStar.

References

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, Feng Zhao, "Are We on the Right Way for Evaluating Large Vision-Language Models?", arXiv, 2024-03-29 (v1) / 2024-04-09 (v2). https://arxiv.org/abs/2403.20330. Accessed 2026-05-21. ↩
MMStar-Benchmark, "MMStar: [NeurIPS 2024] Evaluation code for the paper 'Are We on the Right Way for Evaluating Large Vision-Language Models?'", GitHub repository, 2024-04-02. https://github.com/MMStar-Benchmark/MMStar. Accessed 2026-05-21. ↩
Lin Chen et al., "Are We on the Right Way for Evaluating Large Vision-Language Models?", OpenReview (NeurIPS 2024), 2024-09-25. https://openreview.net/forum?id=evP9mxNNxJ. Accessed 2026-05-21. ↩
Lin Chen, "Lin-Chen/MMStar Dataset", Hugging Face Datasets, 2024-04-02. https://huggingface.co/datasets/Lin-Chen/MMStar. Accessed 2026-05-21. ↩
OpenCompass, "VLMEvalKit: Open-source evaluation toolkit of large multi-modality models", GitHub repository (Open VLM Leaderboard), 2024-2026. https://github.com/open-compass/VLMEvalKit. Accessed 2026-05-21. ↩
Qwen Team, Alibaba Group, "Qwen2.5-VL Technical Report", arXiv:2502.13923, 2025-03-05. https://arxiv.org/pdf/2502.13923. Accessed 2026-05-21. ↩
InternVL Team, "InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency", arXiv:2508.18265, 2025-08-25. https://arxiv.org/html/2508.18265v1. Accessed 2026-05-21. ↩
Weihao Yu et al., "MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities", arXiv:2408.00765, 2024-08-01. https://arxiv.org/pdf/2408.00765. Accessed 2026-05-21. ↩
Kaining Ying et al., "MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI", arXiv:2404.16006, 2024-04-24. https://arxiv.org/pdf/2404.16006. Accessed 2026-05-21. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributor · full history

Suggest edit

What links here

MMMU

Infobox

Background

Timeline and release

Data curation pipeline

Source benchmarks

Coarse automated filtering

Difficulty stratification with LVLMs

Manual review

Capability taxonomy

Multi-modal Gain and Multi-modal Leakage metrics

Evaluation protocol

Reported results from the original paper

Adoption in modern VLM technical reports

Significance

Limitations and criticisms

Related work and comparison

See also

References

Improve this article

Related Articles

ERQA

Fox (benchmark)

MMMU

Paper2Video

Visual Question Answering Models

MathVista

What links here

Related Articles

ERQA

Fox (benchmark)

MMMU

Paper2Video

Visual Question Answering Models

MathVista