MMStar
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,415 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,415 words
Add missing citations, update stale details, or suggest a clearer explanation.
MMStar (Multi-modal Star) is a vision-language model evaluation benchmark consisting of 1,500 multimodal samples that were filtered from six pre-existing benchmarks to ensure both visual dependency (questions that cannot be answered by a text-only large language model exploiting spurious cues) and minimal text leakage (questions unlikely to have appeared verbatim in common pretraining corpora).[^1] The benchmark was introduced in the paper "Are We on the Right Way for Evaluating Large Vision-Language Models?" by Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, and Feng Zhao, a collaboration between the University of Science and Technology of China (USTC), The Chinese University of Hong Kong, and Shanghai AI Laboratory.[^1][^2] The work was first posted to arXiv on 29 March 2024 (revised 9 April 2024) and was accepted at the NeurIPS 2024 conference.[^1][^3] MMStar evaluates a model along six core capability axes and 18 detailed sub-axes, and introduces two diagnostic metrics, multi-modal gain (MG) and multi-modal leakage (ML), to disentangle genuine visual reasoning from text-side memorization.[^1][^2]
| Field | Value |
|---|---|
| Full name | MMStar (Multi-modal Star) |
| Type | Vision-language model evaluation benchmark |
| Sample count | 1,500 multiple-choice samples |
| Core capabilities | 6 (coarse perception, fine-grained perception, instance reasoning, logical reasoning, science & technology, mathematics) |
| Detailed axes | 18 |
| Source benchmarks | MMBench, SEED-Bench, MMMU, ScienceQA, AI2D, MathVista |
| New metrics introduced | Multi-modal Gain (MG), Multi-modal Leakage (ML) |
| Creators | Lin Chen et al. (USTC, CUHK, Shanghai AI Laboratory) |
| Paper | arXiv:2403.20330 |
| First arXiv submission | 29 March 2024 |
| Accepted at | NeurIPS 2024 |
| Project page | mmstar-benchmark.github.io |
| Code repository | github.com/MMStar-Benchmark/MMStar |
| Dataset hosting | Hugging Face (Lin-Chen/MMStar) |
| License | CC BY 4.0 (paper); dataset terms per Hugging Face |
By early 2024 a large number of multimodal evaluation suites had been published to probe the rapidly growing class of large vision-language models (LVLMs), including MMMU, MMBench, SEED-Bench, ScienceQA, AI2D, and MathVista.[^1] Most of these suites consisted of multiple-choice or short-answer items grounded in natural images, scientific diagrams, charts, or screenshots, and they served as the de facto comparison ground for models such as GPT-4V, Gemini Pro Vision, LLaVA variants, and InternVL-Chat.[^1] However, the MMStar authors observed two structural issues that called into question whether reported headline numbers actually measured multi-modal reasoning.
The first issue was vision independence: in many benchmarks the textual question and answer options encoded enough information that a language-only model could answer correctly without ever seeing the image.[^1] Two failure modes contributed to this. In one, the answer was contained in the question itself (for example "What is the shape of the round dirt circle?" trivially admits the answer "circle"); in another, the question relied on world knowledge already memorised by a sufficiently large text model (for example "What is the capital of Nebraska?" requires no visual input even when a map is shown). The MMStar paper showed that on six widely cited benchmarks a closed-source LLM such as Gemini Pro could exceed random baselines by more than 24 percentage points on average without any visual input, and reached 42.9% on MMMU under 0-shot inference and 42.7% under 2-shot.[^1] On ScienceQA and AI2D the team measured "abnormal hit rates" (samples solved by 80% or more of LLM inspectors with text only) of 57.2% and 46.2% respectively.[^1]
The second issue was unintentional data leakage, sometimes called test set contamination. Large language models and large vision-language models are pretrained on web-scale corpora that, in practice, can include or paraphrase benchmark questions. The MMStar authors found that when running LVLMs with the image input removed (a setting they label "LVLM-text"), some models still answered visual-necessary questions correctly at rates well above their underlying LLM backbones.[^1] In one striking case, Sphinx-X-MoE scored 43.6% on MMMU without accessing any image, which was 17.9 percentage points higher than its language-only backbone and exceeded the with-image accuracy of many competing LVLMs.[^1] Yi-VL-34B was 15 percentage points stronger than its LLM backbone on ScienceQA even without images.[^1] These two effects, combined, meant that an apparent ranking on a popular VLM leaderboard could reflect text-side memorisation as much as visual reasoning.
To address both problems jointly, the MMStar authors proposed: (1) a curated benchmark of samples that pass both filters, and (2) two metrics that quantify, per model, how much of a benchmark score is actually attributable to multi-modal training versus to text-side priors.[^1]
| Date | Event |
|---|---|
| 29 March 2024 | arXiv v1 of "Are We on the Right Way for Evaluating Large Vision-Language Models?" (arXiv:2403.20330)[^1] |
| 1 April 2024 | Paper announcement and project page launch[^2] |
| 2 April 2024 | Public release of code repository and Hugging Face dataset[^2][^4] |
| 9 April 2024 | arXiv v2 (revised text, updated tables)[^1] |
| 16 April 2024 | Integration into VLMEvalKit and the OpenCompass multi-modal leaderboard[^2][^5] |
| 25-26 September 2024 | NeurIPS 2024 acceptance and OpenReview publication[^3] |
MMStar is not a benchmark of newly authored questions. Following the authors' phrase "Stands on the shoulders of giants," it is a curated subset distilled from existing benchmarks.[^1]
The team began with six existing multi-modal benchmarks selected to span both natural images and scientific or technical knowledge.[^1]
| Source benchmark | Focus | Citation in MMStar paper |
|---|---|---|
| MMBench | General multi-modal capability, 20 dimensions | Liu et al., 2023 |
| SEED-Bench | 12 evaluation dimensions, image and video | Li et al., 2023 |
| MMMU | College-level multi-discipline, 30 subjects | Yue et al., 2023 |
| ScienceQA | Science domain QA with diagrams | Lu et al., 2022 |
| AI2D | Grade-school science diagrams | Kembhavi et al., 2016 |
| MathVista (Mini) | Mathematical visual reasoning | Lu et al., 2023 |
The initial union of these sources contained 22,401 candidate samples.[^1]
The first stage used eight LLM "inspectors" to identify samples that could be solved without vision. The panel consisted of two closed-source models (GPT-4 Turbo and Gemini Pro) and six open-source models with at least 20 billion parameters (InternLM2, Yi-34B, Mixtral-8x7B, DeepSeek-67B, LLaMA-70B, and Qwen1.5-72B).[^1] Each inspector attempted to answer every candidate sample in a 2-shot in-context configuration without access to the image; the 2-shot setting was chosen because it reduced answer-format refusals compared with 0-shot evaluation.[^1] A sample was kept only if at most two of the eight inspectors answered correctly, meaning that roughly 75% of language-only inspectors failed on it. After this filter, the candidate pool was reduced from 22,401 to 11,607 samples.[^1]
The same 11,607 candidates were then put through 16 leading vision-language models (with images this time) to estimate difficulty. The hit count per sample was recorded and used to split samples into four difficulty bins: easy (12-16 LVLMs correct), moderate (8-11), hard (4-7), and tough (0-3).[^1] This stratification was important because the authors wanted MMStar to span the full difficulty range rather than concentrating only on items that all strong models solve or none solve.
Three human experts then conducted a manual pass to enforce three criteria: each retained sample's answer must genuinely require visual content; the final set must cover a comprehensive range of capability axes; and most samples must require advanced multi-modal capabilities for resolution.[^1] After balancing across both capability dimensions and difficulty levels, the experts retained 1,500 high-quality samples, 250 per core capability.[^1]
MMStar organises its 1,500 samples in a two-level taxonomy: six core capabilities, each subdivided into three detailed axes for a total of 18 axes.[^1]
| Core capability (250 samples each) | Detailed axes |
|---|---|
| Coarse Perception (CP) | Image style & quality; image scene & topic; image emotion |
| Fine-grained Perception (FP) | Attribute & celebrity recognition; object location; object counting |
| Instance Reasoning (IR) | Single-instance attribute reasoning; cross-instance attribute comparison; cross-instance relation reasoning |
| Logical Reasoning (LR) | Diagram reasoning; code & sequence reasoning; common reasoning |
| Science & Technology (ST) | Natural science; engineering; geography & earth science |
| Mathematics (MA) | Numeric commonsense & calculation; geometry; statistical analysis |
Each core capability contains exactly 250 samples, and the paper notes that distribution across the 18 detailed axes is "relatively even" rather than mathematically equal, with three experts balancing both axis coverage and difficulty stratification.[^1] All items are single-image, multiple-choice questions with four options, and the dataset distribution on Hugging Face records 1,500 rows in Parquet format, each containing index, question (with options A/B/C/D embedded), image bytes, gold answer, category, l2_category (the detailed axis), and meta_info pointing back to the source benchmark.[^4]
A central methodological contribution of the paper is a pair of benchmark-specific metrics meant to be reported alongside raw accuracy.[^1]
For a given LVLM, the authors define three quantities:
The two metrics are then defined as:
A model with high MG and low ML is one whose multi-modal training is contributing genuine visual reasoning rather than test-set memorisation. Because MG and ML are computed in a benchmark-specific manner, the paper recommends reporting them on MMStar itself and on each external benchmark a model claims a score on, in order to expose contamination patterns that headline accuracy hides.[^1] The official evaluation protocol requires submitters to upload three result files (LVLM with images, LVLM without images, base LLM without images) so that all three quantities can be computed.[^2][^5]
MMStar uses a 0-shot multiple-choice protocol in which the model is given the question, four options labelled A through D, and the image, and is expected to output a single letter.[^1][^2] The benchmark is offline (all 1,500 samples are released; there is no held-out private split), which makes reproducibility easy but also means an LVLM trainer who wishes to game the score can in principle include MMStar samples in training data; the MG/ML metrics partly defend against this by surfacing test leakage relative to the model's own text-only baseline.[^1]
The reference implementation in the MMStar GitHub repository computes accuracy per detailed axis and per core capability, then aggregates to an overall accuracy.[^2] Integration into the OpenCompass-maintained VLMEvalKit means that MMStar can be evaluated as one of more than 80 benchmarks in a unified one-command pipeline, alongside MMBench, SEED-Bench, MMMU, AI2D, MathVista, OCRBench, HallusionBench, and many others.[^5]
The MMStar paper evaluated 16 leading LVLMs on the 1,500-sample benchmark and reported accuracy together with MG and ML.[^1] Selected results, as reported in the v2 paper and the project leaderboard, are summarised below.
| Model | Avg. accuracy | MG | ML |
|---|---|---|---|
| GPT-4V (high resolution) | 57.1% | 43.6 | 1.3 |
| InternLM-XComposer2 | 55.4% | 28.1 | 7.5 |
| LLaVA-Next-34B | 52.1% | 29.4 | 2.4 |
| GPT-4V (low resolution) | 46.1% | 32.6 | 1.3 |
| InternVL-Chat-v1.2 (40B) | 43.7% | 32.6 | 0.0 |
| Gemini Pro Vision | 42.6% | 27.4 | 0.0 |
| Sphinx-X-MoE (57B) | 38.9% | 14.8 | 1.0 |
| Yi-VL-34B | 36.1% | 18.8 | 0.0 |
| LLaVA-1.5-7B | 30.3% | 10.7 | 0.0 |
| Random baseline | 24.6% | n/a | n/a |
Two patterns stand out. First, even the strongest model in the original release (GPT-4V at high resolution) scored 57.1%, well below the 60% mark that the authors describe as a "passing average score" for the four hardest capability dimensions (fine-grained perception, logical reasoning, science & technology, and mathematics).[^1] Second, MG and ML separate models that look similar on accuracy. GPT-4V combined the highest MG (43.6) with one of the smallest MLs (1.3), suggesting both effective multi-modal training and limited leakage. InternLM-XComposer2 achieved high accuracy (55.4%) but with the largest leakage among the listed models (ML = 7.5), indicating that a meaningful portion of its raw score came from non-visual cues. Several open models report ML = 0 because their LVLM-text setting did not outperform the underlying LLM backbone, which the paper interprets as evidence of clean evaluation rather than evidence of high capability.[^1]
The accompanying analysis on six external benchmarks (MMMU, MMBench-EN-Dev, ScienceQA-Test, AI2D-Test, SEED-Image, MathVista-Mini) showed that models such as Sphinx-X-MoE (ML up to 14% on average) and Monkey-Chat (also around 14%) had substantial multi-modal training leakage, while GPT-4V and Gemini Pro Vision had relatively low ML on those external benchmarks.[^1]
Because MMStar is small (1,500 samples), inexpensive to run, and explicitly filtered for visual dependency, it has become a near-default entry in modern VLM evaluation tables. Reports from major model families published between 2024 and 2026 include MMStar scores alongside the older benchmarks the MMStar authors criticised:
Because of these adoption patterns, MMStar functions less as a single ranking and more as an "honest" companion to older benchmarks: reports that omit MMStar are increasingly conspicuous, and reports that present a strong MMStar number with a modest ML are taken as evidence of genuine visual reasoning rather than text-side priors.
MMStar's significance lies less in being the largest or most expensive multi-modal evaluation and more in formalising, quantifying, and operationalising the critique that vision-language model benchmarks can be solved without vision.[^1] By tying the critique to two concrete diagnostic metrics, MG and ML, the paper changed how subsequent VLM technical reports present their numbers, encouraging the now-common practice of also reporting LVLM-text and LLM-only baselines.[^1][^6] The downstream effect has been broader scrutiny of training data hygiene for VLMs, with several follow-up benchmarks (such as MMT-Bench and MM-Vet v2) explicitly designing for visual indispensability from the start rather than relying on a post hoc filter.[^9]
The construction methodology (an LLM inspector panel, an LVLM difficulty stratification, and a human expert review) has itself influenced other curated benchmarks built on top of existing data, particularly those targeted at exposing shortcuts in vision-language model evaluation.[^1]
Several limitations of MMStar are noted by the authors themselves or have been raised in subsequent literature.
These caveats motivate continued reporting of MMStar alongside, not in place of, other targeted benchmarks.
MMStar sits in a cluster of multi-modal benchmarks released between 2023 and 2024 and is best understood as a "purified subset" sibling rather than a replacement for any one of them.
| Benchmark | Year | Approx. size | Primary lens | Relation to MMStar |
|---|---|---|---|---|
| MMMU | 2023 | ~11,500 college-level items | Multi-discipline expert knowledge | Source benchmark; MMMU samples appear in MMStar after filtering |
| MMBench | 2023 | ~3,200 items in EN-Dev | 20-dimension general capability | Source benchmark; natural-image subset feeds MMStar CP/FP |
| SEED-Bench | 2023 | ~19,000 (incl. video) | 12 evaluation dimensions | Source benchmark; image split feeds MMStar |
| ScienceQA | 2022 | ~21,000 K-12 science items | Diagram + science QA | Source benchmark; high LLM hit rate motivated MMStar |
| AI2D | 2016 | ~4,500 diagram items | Grade-school diagram QA | Source benchmark; high abnormal hit rate (46.2%) |
| MathVista | 2023 | ~6,000 (Mini: 1,000) | Mathematical visual reasoning | Source benchmark; provides math axis content |
| MMStar | 2024 | 1,500 | Visual-indispensable, leakage-aware | Distilled meta-benchmark with MG/ML metrics |
Among the source benchmarks, MMMU and MathVista remain widely cited as standalone tests, while MMStar functions as a leakage-aware overlay that asks whether an apparent strength on those tests is grounded in vision. The MMMU-Pro benchmark, released later, addressed some of the same concerns by reformatting MMMU into a more vision-dependent variant, complementing rather than superseding MMStar.