MMMU-Pro

AI Benchmarks Computer Vision Large Language Models Multimodal AI

18 min read

Updated Jun 24, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 24, 2026

Fact-checked

In review queue

Sources

8 citations

Revision

v3 · 3,589 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

MMMU-Pro is a rigorous benchmark for evaluating multimodal AI systems on college-level, expert questions that genuinely require seeing an image, built as a harder and more robust version of the original MMMU (Massive Multi-discipline Multimodal Understanding and Reasoning) benchmark. Introduced in September 2024 by researchers including Xiang Yue, Tianyu Zheng, Yuansheng Ni, and Graham Neubig, MMMU-Pro strengthens MMMU through a three-step process: filtering out questions that text-only models can answer, expanding the number of answer choices from four to ten, and adding a vision-only input setting in which the question is embedded directly inside an image.^[1] When first released, models scored between 16.8% and 26.9% lower on MMMU-Pro than on the original MMMU, revealing that the earlier benchmark significantly overestimated true multimodal reasoning ability.^[1] The dataset contains 3,460 questions (1,730 standard and 1,730 vision-only) spanning six disciplines, 30 subjects, and 183 subfields, and the paper was published at ACL 2025 in Vienna, Austria.^[1]^[6]

According to the paper, the goal is a benchmark that "can more accurately and rigorously assess a model's true multimodal understanding and reasoning capabilities," testing the "fundamental human cognitive skill of seamlessly integrating visual and textual information."^[1]

Background and Motivation

The original MMMU benchmark, presented at CVPR 2024, consists of about 11,500 college-level multimodal questions drawn from exams, quizzes, and textbooks across six academic disciplines, 30 subjects, and 183 subfields, with 32 highly heterogeneous image types.^[2] It was created to evaluate whether multimodal models could handle expert-level reasoning that requires both visual perception and domain-specific knowledge. However, as frontier models such as GPT-4o and Claude began reaching scores above 60% on the MMMU validation set, the benchmark team identified several structural issues that allowed models to achieve inflated scores.^[1]

The most significant problem was that many questions in MMMU could be answered correctly without ever looking at the accompanying image. Text-only large language models were able to exploit patterns in the question text and answer choices, relying on pre-existing world knowledge or statistical shortcuts rather than genuine visual reasoning. A second issue was the limited number of answer options. With only four choices per question, random guessing alone yields a 25% baseline, and models could further narrow down answers through process-of-elimination heuristics. Together, these issues meant that high MMMU scores did not necessarily reflect true multimodal understanding.^[1]

MMMU-Pro was conceived to close these gaps and produce a benchmark where strong performance genuinely requires integrating visual and textual information.

How is MMMU-Pro constructed?

MMMU-Pro is built from the original MMMU question pool through a three-step pipeline: text-only filtering, answer option augmentation, and vision-only input creation.^[1] Each step systematically increases the difficulty and robustness of the benchmark.

Step 1: Filtering Text-Only Answerable Questions

The first step identifies and removes questions that can be solved without visual input. The researchers selected four strong open-source LLMs for this filtering process:^[1]

Llama 3-70B-Instruct
Qwen 2-72B-Instruct
Yi-1.5-34B-Chat
Mixtral-8x22B-Instruct

Each model was presented with every MMMU question but without access to the accompanying images. The models were required to provide an answer even when they indicated that visual input was necessary. Each question was tested ten times per model to account for stochastic variation. A question was classified as "answerable" by a given model if it was answered correctly more than five times out of ten attempts. A question was excluded from MMMU-Pro if at least three of the four models met this answerability threshold.^[1]

This aggressive filtering targeted two specific failure modes of the original MMMU:

Failure Mode	Description
Text-Only Dependency	Some questions were essentially independent of their images, allowing correct answers based purely on the question text
Shortcut Exploitation	Even when questions required images for humans to answer, models could exploit correlations among the answer choices or leverage pre-existing knowledge to guess correctly

After filtering, the researchers randomly sampled 1,800 questions from the remaining pool, evenly distributing them across all 30 subjects (60 questions per subject).^[1]

Step 2: Augmenting Candidate Options

The second step increases the number of answer choices from four to ten. This change dramatically reduces the effectiveness of guessing. With four options, random selection yields a 25% baseline accuracy; with ten options, that baseline drops to 10%.^[1]

The augmentation process used a multi-stage approach combining AI generation with human review:^[1]

Automated generation: GPT-4o generated additional plausible answer options for each question.
Automated screening: Claude 3.5 Sonnet filtered the generated options, removing those that were contextually irrelevant or logically inconsistent.
First human review: Individual expert reviewers assessed the expanded option sets for each question, checking that options were diverse, logically distinct, and free from ambiguity. Reviewers corrected issues or created new options as needed.
Second human review: Two additional human experts cross-validated every question and its options, eliminating any residual inconsistencies or errors.

This step also served as a quality check on the questions themselves. During the review process, experts identified and removed 70 questions that lacked a clear connection to their images or contained coherence issues, bringing the total from 1,800 to 1,730 questions.^[1]

Step 3: Vision-Only Input Setting

The third step introduces a fundamentally different way of presenting questions to models. Instead of providing the question text and image as separate inputs, the entire question (including text, answer options, and any associated images) is embedded within a single screenshot or photograph. As the paper describes it, "the model is presented with a question embedded within a screenshot or photo, without any text explicitly fed into the model."^[1] This forces models to "see" and "read" simultaneously, mirroring how users often capture screenshots of questions in real-world scenarios rather than typing out the text separately.

Human annotators manually captured these images over a simulated display environment. The process involved varying backgrounds, font styles, and font sizes to replicate the diversity of real-world conditions. By using different combinations of these visual elements, the researchers created a broad range of visual contexts that challenge models not only with the content of the questions but also with variability in presentation.^[1]

This step produced a parallel set of 1,730 vision-only questions, bringing the total MMMU-Pro dataset to 3,460 items: 1,730 in standard format and 1,730 in vision-only format.^[1]

Dataset Overview

MMMU-Pro retains the same disciplinary structure as the original MMMU, covering six core academic disciplines, 30 subjects, and 183 subfields.^[1]

Property	Value
Total questions	3,460 (1,730 standard + 1,730 vision-only)
Subjects	30
Disciplines	6
Subfields	183
Questions per subject	Approximately 57-58 (60 pre-filtering, 70 removed in Step 2)
Answer options	10 (increased from 4 in original MMMU)
Image types	30+ heterogeneous types (charts, diagrams, maps, tables, chemical structures, music sheets, etc.)

Academic Disciplines

The six core disciplines covered by MMMU-Pro are:^[1]

Discipline	Example Subjects
Art and Design	Art history, design theory, visual arts
Business	Accounting, economics, finance, marketing
Science	Biology, chemistry, physics, mathematics
Health and Medicine	Clinical medicine, anatomy, pharmacy
Humanities and Social Science	History, psychology, sociology, philosophy
Tech and Engineering	Computer science, electrical engineering, mechanical engineering

Baseline Performance

To establish reference points, the researchers measured performance under naive guessing strategies:^[1]

Strategy	4 Options	10 Options
Random choice	24.9%	12.8%
Frequent choice	27.8%	12.1%

The substantial drop in guessing baselines from the 4-option to 10-option setting confirms that the option augmentation step effectively reduces the influence of chance on benchmark scores.

Model Performance (Original Paper)

The initial evaluation in the MMMU-Pro paper tested a range of proprietary and open-source models. Performance was measured across three settings: standard with 4 options (matching the original MMMU format), standard with 10 options (the full MMMU-Pro format), and vision-only with 10 options.^[1]

Proprietary Models

Model	Standard (4 Options)	Standard (10 Options)	Vision-Only	MMMU Val	Drop from MMMU
GPT-4o (0513)	64.7%	54.0%	49.7%	69.1%	15.1% to 19.4%
Claude 3.5 Sonnet	63.7%	55.0%	48.0%	68.3%	13.3% to 20.3%
Gemini 1.5 Pro (0801)	60.6%	49.4%	44.4%	65.8%	16.4% to 21.4%
Gemini 1.5 Pro (0523)	57.6%	46.5%	40.5%	62.2%	15.7% to 21.7%
GPT-4o mini	55.3%	39.9%	35.2%	59.4%	19.5% to 24.2%

Open-Source Models

Model	Standard (4 Options)	Standard (10 Options)	Vision-Only	MMMU Val
Qwen 2-VL-72B	59.3%	49.2%	43.3%	64.5%
InternVL2-Llama 3-76B	55.0%	41.9%	38.0%	58.3%
LLaVA-OneVision-72B	52.3%	38.0%	24.0%	56.8%
InternVL2-40B	47.4%	36.3%	32.1%	55.2%
LLaVA-NEXT-72B	N/A	31.0%	19.2%	N/A
VILA-1.5-40B	N/A	35.9%	14.1%	N/A
Phi-3.5-Vision	N/A	26.3%	13.1%	N/A
LLaVA-NEXT-7B	N/A	19.4%	14.6%	N/A

Several patterns emerge from these results. First, every model experienced a significant performance drop when moving from MMMU to MMMU-Pro, confirming that the original benchmark overestimated multimodal capabilities. Second, the vision-only setting consistently proved harder than the standard 10-option setting, indicating that models struggle when they cannot receive question text as a separate input. Third, the gap between proprietary and open-source models widened on MMMU-Pro, suggesting that the harder benchmark better differentiates model capabilities.^[1]

Key Findings and Analysis

Impact of Option Augmentation

Increasing the number of answer choices from four to ten had a pronounced effect on accuracy. For example, GPT-4o experienced a drop of 10.7 percentage points (from 64.7% to 54.0%) when moving from 4 to 10 options in the standard setting. This confirms that a meaningful portion of performance on the original MMMU came from effective elimination of implausible options rather than genuine understanding.^[1]

Does OCR ability predict multimodal reasoning?

The researchers tested whether providing explicit OCR (optical character recognition) prompts would help models perform better on the vision-only setting. The results showed that OCR prompts had minimal impact on performance:^[1]

Model	OCR Accuracy	Vision with OCR	Vision without OCR
GPT-4o	92.3%	49.7%	49.4%
Gemini 1.5 Pro (0801)	89.7%	44.4%	43.6%
GPT-4o mini	89.6%	35.2%	35.6%
InternVL2-Llama 3-76B	88.1%	38.0%	37.9%
LLaVA-OneVision-72B	87.8%	24.0%	23.8%

A critical insight from this analysis is that high OCR accuracy does not translate into strong multimodal reasoning. LLaVA-OneVision-72B matched InternVL2-Llama3-76B and GPT-4o mini in OCR accuracy but lagged substantially behind both in MMMU-Pro Vision performance. According to the paper, "text recognition and OCR do not prove to be the primary bottleneck," and the main challenges lie "in the integration and interpretation of visual and textual information."^[1] This indicates that the bottleneck for multimodal understanding lies in reasoning and integration of information, not in the ability to extract text from images.

Chain-of-Thought Prompting Effects

Chain-of-thought (CoT) prompting, which encourages models to show their reasoning step by step, generally improved performance on MMMU-Pro. However, the benefit varied dramatically across disciplines and models.^[1]

For GPT-4o, CoT prompting showed clear benefits in reasoning-heavy domains:

Discipline	CoT Impact (GPT-4o)	CoT Impact (LLaVA-OneVision-72B)
Art and Design	+1.58%	-17.12%
Business	+14.66%	+4.76%
Science	+8.22%	+1.28%
Health and Medicine	+5.34%	-1.56%
Humanities and Social Science	+2.21%	-4.46%
Tech and Engineering	+14.49%	+2.33%

GPT-4o benefited substantially from CoT across nearly all disciplines, with the largest gains in Business (+14.66%) and Tech and Engineering (+14.49%). In contrast, LLaVA-OneVision-72B experienced negative effects from CoT in Art and Design (-17.12%), Health and Medicine (-1.56%), and Humanities and Social Science (-4.46%). This suggests that weaker models may generate incorrect reasoning chains that lead them further from the correct answer, while stronger models can leverage structured reasoning to improve accuracy.^[1]

Error Analysis

The researchers analyzed 60 error cases from GPT-4o on MMMU-Pro and compared the error distribution to the original MMMU. Reasoning errors accounted for 46% of failures on MMMU-Pro, a significant increase from 26% on the original MMMU.^[1] This shift indicates that the benchmark successfully targets higher-order cognitive abilities. With text-only shortcuts removed and more answer options to evaluate, models must engage in genuine multimodal reasoning, and their failures more frequently stem from flawed reasoning rather than simple perception errors.

Response Length Patterns

An interesting behavioral pattern emerged in the vision-only setting. GPT-4o generated significantly shorter responses when processing vision-only inputs, and a larger proportion of its output tokens were spent on description rather than analysis. The researchers hypothesize that the increased cognitive load of processing visual inputs forces the model to allocate more capacity to visual processing, which detracts from generating extensive reasoning chains.^[1]

How well do humans perform on MMMU-Pro?

The paper also reports approximate human performance levels on MMMU-Pro, broken down by expertise level:^[1]

Expertise Level	Overall	Art and Design	Business	Science	Health and Medicine	Humanities	Tech and Engineering
Low	73.0%	77.4%	77.9%	78.5%	65.2%	63.6%	73.5%
Medium	80.8%	83.3%	88.4%	84.9%	72.8%	75.8%	78.2%
High	85.4%	85.7%	89.5%	86.0%	84.8%	81.8%	84.4%

Even high-expertise humans achieved only 85.4% overall accuracy, confirming that MMMU-Pro contains genuinely challenging questions. Health and Medicine proved the most difficult discipline for lower-expertise humans (65.2%), while Business was the easiest across all expertise levels. The gap between the best-performing AI model at the time of publication (Claude 3.5 Sonnet at 55.0% in standard 10-option mode) and high-expertise humans (85.4%) was over 30 percentage points, indicating substantial room for improvement.^[1]

Vision Encoder Analysis

The paper includes an analysis of how different vision transformer encoders affect MMMU-Pro performance:^[1]

Vision Encoder	MMMU Val	MMMU-Pro Vision
DINOv2 ViT-G-14	37.1%	17.4%
SigLIP ViT-SO400M-14	37.9%	16.7%

Both encoders performed similarly on the original MMMU validation set but diverged slightly on MMMU-Pro Vision, with DINOv2 showing a marginal advantage. The overall low scores in both cases underscore that vision encoders alone, without strong language reasoning capabilities, are insufficient for the complex tasks in MMMU-Pro.

Updated Leaderboard (2025-2026)

Since the original paper's publication, newer and more capable models have been evaluated on MMMU-Pro. As of late May 2026, the leaderboard reflects dramatic progress, with frontier models now approaching and in some cases exceeding high-expertise human performance (85.4%):^[4]^[5]^[7]

Rank	Model	Developer	MMMU-Pro Score
1	GPT-5.4 Pro	OpenAI	94.0%
2	Claude Mythos Preview	Anthropic	92.7%
3	Gemini 3.1 Pro	Google	83.9%
4	Gemini 3.5 Flash	Google	83.6%
5	GPT-5.5	OpenAI	81.2%
6	GPT-5.4	OpenAI	81.2%
7	Gemini 3 Pro	Google	81.0%
8	Muse Spark	Meta	80.4%
9	GPT-5.2	OpenAI	79.5%
10	Kimi K2.6	Moonshot AI	79.4%
11	Qwen 3.5-397B	Alibaba	79.0%
12	Qwen 3.6 Plus	Alibaba	78.8%
13	Kimi K2.5	Moonshot AI	78.5%
14	Grok 4.3	xAI	78.1%
15	Claude Opus 4.6	Anthropic	77.3%

These results show dramatic improvements over the original paper's findings. The best frontier models now exceed human-expert-level performance (85.4%), with GPT-5.4 Pro at 94% and Claude Mythos Preview at 92.7% both surpassing that threshold. The benchmark therefore remains useful as a discriminator among frontier models, but the human performance ceiling has been crossed for the top tier.

Additional Notable Scores

Model	Developer	MMMU-Pro Score
Qwen 3 VL 235B A22B (Thinking)	Alibaba	69.3%
Qwen 3 VL 32B (Thinking)	Alibaba	68.1%
GPT-5.4 nano	OpenAI	66.1%
GPT-4o	OpenAI	59.9%
Llama 4 Maverick	Meta	59.6%
Mistral Small 4	Mistral AI	60.0%
Qwen 2.5 VL 72B Instruct	Alibaba	51.1%
Phi-4-multimodal-instruct	Microsoft	38.5%
Llama 3.2 90B Instruct	Meta	45.2%
Llama 3.2 11B Instruct	Meta	33.0%

How does MMMU-Pro differ from MMMU?

The following table summarizes the key differences between the original MMMU and MMMU-Pro:^[1]^[2]

Feature	MMMU	MMMU-Pro
Total questions	11,500	3,460
Answer options	4	10
Random guess baseline	~25%	~10%
Text-only filtering	No	Yes (questions answerable without images removed)
Vision-only setting	No	Yes (questions embedded in screenshots/photos)
Input format	Text + image (separate)	Standard (text + image) and vision-only
Publication	CVPR 2024	ACL 2025
Human quality control	Original annotation	Additional two rounds of expert review
Purpose	Broad multimodal evaluation	Robust multimodal reasoning assessment

The most important conceptual shift is that MMMU-Pro specifically targets genuine multimodal reasoning. While the original MMMU tested a mixture of visual and non-visual knowledge, MMMU-Pro ensures that every question fundamentally requires visual understanding. This makes it a more reliable indicator of how well a model can truly integrate information across modalities.

Impact and Significance

MMMU-Pro has had a significant influence on how the AI research community evaluates multimodal models. Several key contributions stand out.

First, the benchmark exposed a widespread overestimation of model capabilities. The 16.8% to 26.9% performance drops demonstrated that many models were relying on textual shortcuts rather than visual understanding.^[1] This finding prompted researchers and developers to reconsider how they measure progress in multimodal AI.

Second, the vision-only input setting introduced a more realistic evaluation paradigm. In practice, users frequently share screenshots or photographs of content rather than carefully separating text and images. By testing models on this type of input, MMMU-Pro measures performance in scenarios that better reflect actual usage patterns.

Third, the finding that OCR accuracy does not predict multimodal reasoning performance highlighted an important research direction. Simply improving a model's ability to extract text from images is insufficient; the real challenge lies in understanding how textual and visual information relate to each other and reasoning about their combined meaning.^[1]

Fourth, the nuanced analysis of chain-of-thought prompting across disciplines provided practical guidance for deploying multimodal models. The finding that CoT can sometimes hurt performance in certain domains (particularly for weaker models) has implications for how prompting strategies should be tuned based on model capability and task domain.^[1]

Is MMMU-Pro publicly available?

Yes. MMMU-Pro is publicly available on Hugging Face at the MMMU/MMMU_Pro repository.^[3] The dataset includes both the standard and vision-only question formats. Evaluation code is maintained on GitHub in the MMMU-Benchmark/MMMU repository.

Authors and Affiliations

The MMMU-Pro paper was authored by Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, and Graham Neubig.^[1] The work was primarily conducted while Xiang Yue was a postdoctoral researcher at Carnegie Mellon University, working with Graham Neubig. Other contributors were affiliated with institutions including the University of Waterloo, The Ohio State University, and others. The paper was published in the Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15134 to 15186, in Vienna, Austria, in July 2025.^[6]

References

Yue, X., Zheng, T., Ni, Y., Wang, Y., Zhang, K., Tong, S., Sun, Y., Yu, B., Zhang, G., Sun, H., Su, Y., Chen, W., & Neubig, G. (2025). MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark. *Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)*, 15134-15186. Vienna, Austria. arXiv:2409.02813. https://arxiv.org/abs/2409.02813 ↩
Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., Wei, C., Yu, B., Yuan, R., Sun, R., Yin, M., Zheng, B., Yang, Z., Liu, Y., Huang, W., Sun, H., Su, Y., & Chen, W. (2024). MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI. *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*, 9556-9567. ↩
MMMU-Pro Dataset. Hugging Face. https://huggingface.co/datasets/MMMU/MMMU_Pro ↩
MMMU-Pro Benchmark Leaderboard. Artificial Analysis. https://artificialanalysis.ai/evaluations/mmmu-pro ↩
MMMU-Pro Leaderboard. LLM Stats. https://llm-stats.com/benchmarks/mmmu-pro ↩
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark. ACL Anthology. https://aclanthology.org/2025.acl-long.736/ ↩
MMMU-Pro Benchmark 2026: 27 LLM scores. BenchLM.ai. https://benchlm.ai/benchmarks/mmmuPro (accessed May 2026). ↩
MMMU Benchmark. https://mmmu-benchmark.github.io/

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

MMMU MMStar Muse Spark Qwen2.5-VL Skywork-R1V Video-MMMU Vision language model

Background and Motivation

How is MMMU-Pro constructed?

Step 1: Filtering Text-Only Answerable Questions

Step 2: Augmenting Candidate Options

Step 3: Vision-Only Input Setting

Dataset Overview

Academic Disciplines

Baseline Performance

Model Performance (Original Paper)

Proprietary Models

Open-Source Models

Key Findings and Analysis

Impact of Option Augmentation

Does OCR ability predict multimodal reasoning?

Chain-of-Thought Prompting Effects

Error Analysis

Response Length Patterns

How well do humans perform on MMMU-Pro?

Vision Encoder Analysis

Updated Leaderboard (2025-2026)

Additional Notable Scores

How does MMMU-Pro differ from MMMU?

Impact and Significance

Is MMMU-Pro publicly available?

Authors and Affiliations

See Also

References

Improve this article

Related Articles

Fox (benchmark)

Visual Question Answering Models

CLIP Score

EgoSchema

ZeroBench

Video-MME

What links here

Related Articles

Fox (benchmark)

Visual Question Answering Models

CLIP Score

EgoSchema

ZeroBench

Video-MME

What links here