MMMU-Pro is a rigorous benchmark for evaluating multimodal AI systems, designed as a harder and more robust version of the original MMMU (Massive Multi-discipline Multimodal Understanding and Reasoning) benchmark. Introduced in September 2024 by researchers including Xiang Yue, Tianyu Zheng, Yuansheng Ni, and Graham Neubig, MMMU-Pro addresses key weaknesses in the original MMMU by filtering out questions that text-only models can answer, expanding the number of answer choices from four to ten, and introducing a vision-only input setting where questions are embedded directly into images. The benchmark was published at ACL 2025 (the 63rd Annual Meeting of the Association for Computational Linguistics) in Vienna, Austria. When first released, models scored between 16.8% and 26.9% lower on MMMU-Pro compared to the original MMMU, demonstrating that the original benchmark significantly overestimated multimodal reasoning ability.
The original MMMU benchmark, presented at CVPR 2024, consists of 11,500 college-level multimodal questions drawn from exams, quizzes, and textbooks across six academic disciplines and 30 subjects. It was created to evaluate whether multimodal models could handle expert-level reasoning that requires both visual perception and domain-specific knowledge. However, as frontier models such as GPT-4o and Claude began reaching scores above 60% on the MMMU validation set, the benchmark team identified several structural issues that allowed models to achieve inflated scores.
The most significant problem was that many questions in MMMU could be answered correctly without ever looking at the accompanying image. Text-only large language models were able to exploit patterns in the question text and answer choices, relying on pre-existing world knowledge or statistical shortcuts rather than genuine visual reasoning. A second issue was the limited number of answer options. With only four choices per question, random guessing alone yields a 25% baseline, and models could further narrow down answers through process-of-elimination heuristics. Together, these issues meant that high MMMU scores did not necessarily reflect true multimodal understanding.
MMU-Pro was conceived to close these gaps and produce a benchmark where strong performance genuinely requires integrating visual and textual information.
MMMU-Pro is built from the original MMMU question pool through a three-step pipeline: text-only filtering, answer option augmentation, and vision-only input creation. Each step systematically increases the difficulty and robustness of the benchmark.
The first step identifies and removes questions that can be solved without visual input. The researchers selected four strong open-source LLMs for this filtering process:
Each model was presented with every MMMU question but without access to the accompanying images. The models were required to provide an answer even when they indicated that visual input was necessary. Each question was tested ten times per model to account for stochastic variation. A question was classified as "answerable" by a given model if it was answered correctly more than five times out of ten attempts. A question was excluded from MMMU-Pro if at least three of the four models met this answerability threshold.
This aggressive filtering targeted two specific failure modes of the original MMMU:
| Failure Mode | Description |
|---|---|
| Text-Only Dependency | Some questions were essentially independent of their images, allowing correct answers based purely on the question text |
| Shortcut Exploitation | Even when questions required images for humans to answer, models could exploit correlations among the answer choices or leverage pre-existing knowledge to guess correctly |
After filtering, the researchers randomly sampled 1,800 questions from the remaining pool, evenly distributing them across all 30 subjects (60 questions per subject).
The second step increases the number of answer choices from four to ten. This change dramatically reduces the effectiveness of guessing. With four options, random selection yields a 25% baseline accuracy; with ten options, that baseline drops to 10%.
The augmentation process used a multi-stage approach combining AI generation with human review:
This step also served as a quality check on the questions themselves. During the review process, experts identified and removed 70 questions that lacked a clear connection to their images or contained coherence issues, bringing the total from 1,800 to 1,730 questions.
The third step introduces a fundamentally different way of presenting questions to models. Instead of providing the question text and image as separate inputs, the entire question (including text, answer options, and any associated images) is embedded within a single screenshot or photograph. This forces models to "see" and "read" simultaneously, mirroring how users often capture screenshots of questions in real-world scenarios rather than typing out the text separately.
Human annotators manually captured these images over a simulated display environment. The process involved varying backgrounds, font styles, and font sizes to replicate the diversity of real-world conditions. By using different combinations of these visual elements, the researchers created a broad range of visual contexts that challenge models not only with the content of the questions but also with variability in presentation.
This step produced a parallel set of 1,730 vision-only questions, bringing the total MMMU-Pro dataset to 3,460 items: 1,730 in standard format and 1,730 in vision-only format.
MMU-Pro retains the same disciplinary structure as the original MMMU, covering six core academic disciplines, 30 subjects, and 183 subfields.
| Property | Value |
|---|---|
| Total questions | 3,460 (1,730 standard + 1,730 vision-only) |
| Subjects | 30 |
| Disciplines | 6 |
| Subfields | 183 |
| Questions per subject | Approximately 57-58 (60 pre-filtering, 70 removed in Step 2) |
| Answer options | 10 (increased from 4 in original MMMU) |
| Image types | 30+ heterogeneous types (charts, diagrams, maps, tables, chemical structures, music sheets, etc.) |
The six core disciplines covered by MMMU-Pro are:
| Discipline | Example Subjects |
|---|---|
| Art and Design | Art history, design theory, visual arts |
| Business | Accounting, economics, finance, marketing |
| Science | Biology, chemistry, physics, mathematics |
| Health and Medicine | Clinical medicine, anatomy, pharmacy |
| Humanities and Social Science | History, psychology, sociology, philosophy |
| Tech and Engineering | Computer science, electrical engineering, mechanical engineering |
To establish reference points, the researchers measured performance under naive guessing strategies:
| Strategy | 4 Options | 10 Options | |---|---| | Random choice | 24.9% | 12.8% | | Frequent choice | 27.8% | 12.1% |
The substantial drop in guessing baselines from the 4-option to 10-option setting confirms that the option augmentation step effectively reduces the influence of chance on benchmark scores.
The initial evaluation in the MMMU-Pro paper tested a range of proprietary and open-source models. Performance was measured across three settings: standard with 4 options (matching the original MMMU format), standard with 10 options (the full MMMU-Pro format), and vision-only with 10 options.
| Model | Standard (4 Options) | Standard (10 Options) | Vision-Only | MMMU Val | Drop from MMMU |
|---|---|---|---|---|---|
| GPT-4o (0513) | 64.7% | 54.0% | 49.7% | 69.1% | 15.1% to 19.4% |
| Claude 3.5 Sonnet | 63.7% | 55.0% | 48.0% | 68.3% | 13.3% to 20.3% |
| Gemini 1.5 Pro (0801) | 60.6% | 49.4% | 44.4% | 65.8% | 16.4% to 21.4% |
| Gemini 1.5 Pro (0523) | 57.6% | 46.5% | 40.5% | 62.2% | 15.7% to 21.7% |
| GPT-4o mini | 55.3% | 39.9% | 35.2% | 59.4% | 19.5% to 24.2% |
| Model | Standard (4 Options) | Standard (10 Options) | Vision-Only | MMMU Val |
|---|---|---|---|---|
| Qwen 2-VL-72B | 59.3% | 49.2% | 43.3% | 64.5% |
| InternVL2-Llama 3-76B | 55.0% | 41.9% | 38.0% | 58.3% |
| LLaVA-OneVision-72B | 52.3% | 38.0% | 24.0% | 56.8% |
| InternVL2-40B | 47.4% | 36.3% | 32.1% | 55.2% |
| LLaVA-NEXT-72B | N/A | 31.0% | 19.2% | N/A |
| VILA-1.5-40B | N/A | 35.9% | 14.1% | N/A |
| Phi-3.5-Vision | N/A | 26.3% | 13.1% | N/A |
| LLaVA-NEXT-7B | N/A | 19.4% | 14.6% | N/A |
Several patterns emerge from these results. First, every model experienced a significant performance drop when moving from MMMU to MMMU-Pro, confirming that the original benchmark overestimated multimodal capabilities. Second, the vision-only setting consistently proved harder than the standard 10-option setting, indicating that models struggle when they cannot receive question text as a separate input. Third, the gap between proprietary and open-source models widened on MMMU-Pro, suggesting that the harder benchmark better differentiates model capabilities.
Increasing the number of answer choices from four to ten had a pronounced effect on accuracy. For example, GPT-4o experienced a drop of 10.7 percentage points (from 64.7% to 54.0%) when moving from 4 to 10 options in the standard setting. This confirms that a meaningful portion of performance on the original MMMU came from effective elimination of implausible options rather than genuine understanding.
The researchers tested whether providing explicit OCR (optical character recognition) prompts would help models perform better on the vision-only setting. The results showed that OCR prompts had minimal impact on performance:
| Model | OCR Accuracy | Vision with OCR | Vision without OCR |
|---|---|---|---|
| GPT-4o | 92.3% | 49.7% | 49.4% |
| Gemini 1.5 Pro (0801) | 89.7% | 44.4% | 43.6% |
| GPT-4o mini | 89.6% | 35.2% | 35.6% |
| InternVL2-Llama 3-76B | 88.1% | 38.0% | 37.9% |
| LLaVA-OneVision-72B | 87.8% | 24.0% | 23.8% |
A critical insight from this analysis is that high OCR accuracy does not translate into strong multimodal reasoning. LLaVA-OneVision-72B matched InternVL2-Llama3-76B and GPT-4o mini in OCR accuracy but lagged substantially behind both in MMMU-Pro Vision performance. This indicates that the bottleneck for multimodal understanding lies in reasoning and integration of information, not in the ability to extract text from images.
Chain-of-thought (CoT) prompting, which encourages models to show their reasoning step by step, generally improved performance on MMMU-Pro. However, the benefit varied dramatically across disciplines and models.
For GPT-4o, CoT prompting showed clear benefits in reasoning-heavy domains:
| Discipline | CoT Impact (GPT-4o) | CoT Impact (LLaVA-OneVision-72B) |
|---|---|---|
| Art and Design | +1.58% | -17.12% |
| Business | +14.66% | +4.76% |
| Science | +8.22% | +1.28% |
| Health and Medicine | +5.34% | -1.56% |
| Humanities and Social Science | +2.21% | -4.46% |
| Tech and Engineering | +14.49% | +2.33% |
GPT-4o benefited substantially from CoT across nearly all disciplines, with the largest gains in Business (+14.66%) and Tech and Engineering (+14.49%). In contrast, LLaVA-OneVision-72B experienced negative effects from CoT in Art and Design (-17.12%), Health and Medicine (-1.56%), and Humanities and Social Science (-4.46%). This suggests that weaker models may generate incorrect reasoning chains that lead them further from the correct answer, while stronger models can leverage structured reasoning to improve accuracy.
The researchers analyzed 60 error cases from GPT-4o on MMMU-Pro and compared the error distribution to the original MMMU. Reasoning errors accounted for 46% of failures on MMMU-Pro, a significant increase from 26% on the original MMMU. This shift indicates that the benchmark successfully targets higher-order cognitive abilities. With text-only shortcuts removed and more answer options to evaluate, models must engage in genuine multimodal reasoning, and their failures more frequently stem from flawed reasoning rather than simple perception errors.
An interesting behavioral pattern emerged in the vision-only setting. GPT-4o generated significantly shorter responses when processing vision-only inputs, and a larger proportion of its output tokens were spent on description rather than analysis. The researchers hypothesize that the increased cognitive load of processing visual inputs forces the model to allocate more capacity to visual processing, which detracts from generating extensive reasoning chains.
The paper also reports approximate human performance levels on MMMU-Pro, broken down by expertise level:
| Expertise Level | Overall | Art and Design | Business | Science | Health and Medicine | Humanities | Tech and Engineering |
|---|---|---|---|---|---|---|---|
| Low | 73.0% | 77.4% | 77.9% | 78.5% | 65.2% | 63.6% | 73.5% |
| Medium | 80.8% | 83.3% | 88.4% | 84.9% | 72.8% | 75.8% | 78.2% |
| High | 85.4% | 85.7% | 89.5% | 86.0% | 84.8% | 81.8% | 84.4% |
Even high-expertise humans achieved only 85.4% overall accuracy, confirming that MMMU-Pro contains genuinely challenging questions. Health and Medicine proved the most difficult discipline for lower-expertise humans (65.2%), while Business was the easiest across all expertise levels. The gap between the best-performing AI model at the time of publication (Claude 3.5 Sonnet at 55.0% in standard 10-option mode) and high-expertise humans (85.4%) was over 30 percentage points, indicating substantial room for improvement.
The paper includes an analysis of how different vision transformer encoders affect MMMU-Pro performance:
| Vision Encoder | MMMU Val | MMMU-Pro Vision |
|---|---|---|
| DINOv2 ViT-G-14 | 37.1% | 17.4% |
| SigLIP ViT-SO400M-14 | 37.9% | 16.7% |
Both encoders performed similarly on the original MMMU validation set but diverged slightly on MMMU-Pro Vision, with DINOv2 showing a marginal advantage. The overall low scores in both cases underscore that vision encoders alone, without strong language reasoning capabilities, are insufficient for the complex tasks in MMMU-Pro.
Since the original paper's publication, newer and more capable models have been evaluated on MMMU-Pro. As of early 2026, the leaderboard reflects significant progress:
| Rank | Model | Developer | MMMU-Pro Score |
|---|---|---|---|
| 1 | GPT-5.4 | OpenAI | 81.2% |
| 2 | Gemini 3 Flash | 81.2% | |
| 3 | Gemini 3 Pro | 81.0% | |
| 4 | Gemini 3.1 Pro | 80.5% | |
| 5 | GPT-5.2 | OpenAI | 79.5% |
| 6 | Kimi K2.5 | Moonshot AI | 78.5% |
| 7 | GPT-5 | OpenAI | 78.4% |
| 8 | Claude Opus 4.6 | Anthropic | 77.3% |
| 9 | Qwen 3.5-122B-A10B | Alibaba | 76.9% |
| 10 | Gemini 3.1 Flash-Lite | 76.8% | |
| 11 | GPT-5.4 mini | OpenAI | 76.6% |
| 12 | o3 | OpenAI | 76.4% |
| 13 | Claude Sonnet 4.6 | Anthropic | 75.6% |
| 14 | Qwen 3.5-35B-A3B | Alibaba | 75.1% |
| 15 | Qwen 3.5-27B | Alibaba | 75.0% |
These results show dramatic improvements over the original paper's findings. The top models now approach human-expert-level performance (85.4%), with the best systems scoring above 81%. The gap between proprietary and open-source models has also narrowed, with Qwen 3.5 variants achieving competitive scores in the 75-77% range.
| Model | Developer | MMMU-Pro Score |
|---|---|---|
| Qwen 3 VL 235B A22B (Thinking) | Alibaba | 69.3% |
| Qwen 3 VL 32B (Thinking) | Alibaba | 68.1% |
| GPT-5.4 nano | OpenAI | 66.1% |
| GPT-4o | OpenAI | 59.9% |
| Llama 4 Maverick | Meta | 59.6% |
| Mistral Small 4 | Mistral AI | 60.0% |
| Qwen 2.5 VL 72B Instruct | Alibaba | 51.1% |
| Phi-4-multimodal-instruct | Microsoft | 38.5% |
| Llama 3.2 90B Instruct | Meta | 45.2% |
| Llama 3.2 11B Instruct | Meta | 33.0% |
The following table summarizes the key differences between the original MMMU and MMMU-Pro:
| Feature | MMMU | MMMU-Pro |
|---|---|---|
| Total questions | 11,500 | 3,460 |
| Answer options | 4 | 10 |
| Random guess baseline | ~25% | ~10% |
| Text-only filtering | No | Yes (questions answerable without images removed) |
| Vision-only setting | No | Yes (questions embedded in screenshots/photos) |
| Input format | Text + image (separate) | Standard (text + image) and vision-only |
| Publication | CVPR 2024 | ACL 2025 |
| Human quality control | Original annotation | Additional two rounds of expert review |
| Purpose | Broad multimodal evaluation | Robust multimodal reasoning assessment |
The most important conceptual shift is that MMMU-Pro specifically targets genuine multimodal reasoning. While the original MMMU tested a mixture of visual and non-visual knowledge, MMMU-Pro ensures that every question fundamentally requires visual understanding. This makes it a more reliable indicator of how well a model can truly integrate information across modalities.
MMU-Pro has had a significant influence on how the AI research community evaluates multimodal models. Several key contributions stand out.
First, the benchmark exposed a widespread overestimation of model capabilities. The 16.8% to 26.9% performance drops demonstrated that many models were relying on textual shortcuts rather than visual understanding. This finding prompted researchers and developers to reconsider how they measure progress in multimodal AI.
Second, the vision-only input setting introduced a more realistic evaluation paradigm. In practice, users frequently share screenshots or photographs of content rather than carefully separating text and images. By testing models on this type of input, MMMU-Pro measures performance in scenarios that better reflect actual usage patterns.
Third, the finding that OCR accuracy does not predict multimodal reasoning performance highlighted an important research direction. Simply improving a model's ability to extract text from images is insufficient; the real challenge lies in understanding how textual and visual information relate to each other and reasoning about their combined meaning.
Fourth, the nuanced analysis of chain-of-thought prompting across disciplines provided practical guidance for deploying multimodal models. The finding that CoT can sometimes hurt performance in certain domains (particularly for weaker models) has implications for how prompting strategies should be tuned based on model capability and task domain.
MMU-Pro is publicly available on Hugging Face at the MMMU/MMMU_Pro repository. The dataset includes both the standard and vision-only question formats. Evaluation code is maintained on GitHub in the MMMU-Benchmark/MMMU repository.
The MMMU-Pro paper was authored by Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, and Graham Neubig. The work was primarily conducted while Xiang Yue was a postdoctoral researcher at Carnegie Mellon University, working with Graham Neubig. Other contributors were affiliated with institutions including the University of Waterloo, The Ohio State University, and others. The paper was published in the Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15134 to 15186, in Vienna, Austria, in July 2025.