# MMMU-Pro

> Source: https://aiwiki.ai/wiki/mmmu-pro
> Updated: 2026-06-24
> Categories: AI Benchmarks, Computer Vision, Large Language Models, Multimodal AI
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**MMMU-Pro** is a rigorous [benchmark](/wiki/benchmark) for evaluating [multimodal AI](/wiki/multimodal_ai) systems on college-level, expert questions that genuinely require seeing an image, built as a harder and more robust version of the original [MMMU](/wiki/mmmu) (Massive Multi-discipline Multimodal Understanding and Reasoning) benchmark. Introduced in September 2024 by researchers including Xiang Yue, Tianyu Zheng, Yuansheng Ni, and Graham Neubig, MMMU-Pro strengthens MMMU through a three-step process: filtering out questions that text-only models can answer, expanding the number of answer choices from four to ten, and adding a vision-only input setting in which the question is embedded directly inside an image.[1] When first released, models scored between 16.8% and 26.9% lower on MMMU-Pro than on the original MMMU, revealing that the earlier benchmark significantly overestimated true multimodal reasoning ability.[1] The dataset contains 3,460 questions (1,730 standard and 1,730 vision-only) spanning six disciplines, 30 subjects, and 183 subfields, and the paper was published at ACL 2025 in Vienna, Austria.[1][6]

According to the paper, the goal is a benchmark that "can more accurately and rigorously assess a model's true multimodal understanding and reasoning capabilities," testing the "fundamental human cognitive skill of seamlessly integrating visual and textual information."[1]

## Background and Motivation

The original [MMMU](/wiki/mmmu) benchmark, presented at CVPR 2024, consists of about 11,500 college-level multimodal questions drawn from exams, quizzes, and textbooks across six academic disciplines, 30 subjects, and 183 subfields, with 32 highly heterogeneous image types.[2] It was created to evaluate whether multimodal models could handle expert-level reasoning that requires both visual perception and domain-specific knowledge. However, as frontier models such as [GPT-4o](/wiki/gpt_4o) and [Claude](/wiki/claude) began reaching scores above 60% on the MMMU validation set, the benchmark team identified several structural issues that allowed models to achieve inflated scores.[1]

The most significant problem was that many questions in MMMU could be answered correctly without ever looking at the accompanying image. Text-only [large language models](/wiki/large_language_model) were able to exploit patterns in the question text and answer choices, relying on pre-existing world knowledge or statistical shortcuts rather than genuine visual reasoning. A second issue was the limited number of answer options. With only four choices per question, random guessing alone yields a 25% baseline, and models could further narrow down answers through process-of-elimination heuristics. Together, these issues meant that high MMMU scores did not necessarily reflect true multimodal understanding.[1]

MMMU-Pro was conceived to close these gaps and produce a benchmark where strong performance genuinely requires integrating visual and textual information.

## How is MMMU-Pro constructed?

MMMU-Pro is built from the original MMMU question pool through a three-step pipeline: text-only filtering, answer option augmentation, and vision-only input creation.[1] Each step systematically increases the difficulty and robustness of the benchmark.

### Step 1: Filtering Text-Only Answerable Questions

The first step identifies and removes questions that can be solved without visual input. The researchers selected four strong open-source [LLMs](/wiki/large_language_model) for this filtering process:[1]

- [Llama](/wiki/llama) 3-70B-Instruct
- [Qwen](/wiki/qwen) 2-72B-Instruct
- Yi-1.5-34B-Chat
- [Mixtral](/wiki/mixtral)-8x22B-Instruct

Each model was presented with every MMMU question but without access to the accompanying images. The models were required to provide an answer even when they indicated that visual input was necessary. Each question was tested ten times per model to account for stochastic variation. A question was classified as "answerable" by a given model if it was answered correctly more than five times out of ten attempts. A question was excluded from MMMU-Pro if at least three of the four models met this answerability threshold.[1]

This aggressive filtering targeted two specific failure modes of the original MMMU:

| Failure Mode | Description |
|---|---|
| Text-Only Dependency | Some questions were essentially independent of their images, allowing correct answers based purely on the question text |
| Shortcut Exploitation | Even when questions required images for humans to answer, models could exploit correlations among the answer choices or leverage pre-existing knowledge to guess correctly |

After filtering, the researchers randomly sampled 1,800 questions from the remaining pool, evenly distributing them across all 30 subjects (60 questions per subject).[1]

### Step 2: Augmenting Candidate Options

The second step increases the number of answer choices from four to ten. This change dramatically reduces the effectiveness of guessing. With four options, random selection yields a 25% baseline accuracy; with ten options, that baseline drops to 10%.[1]

The augmentation process used a multi-stage approach combining AI generation with human review:[1]

1. **Automated generation**: [GPT-4o](/wiki/gpt_4o) generated additional plausible answer options for each question.
2. **Automated screening**: [Claude](/wiki/claude) 3.5 Sonnet filtered the generated options, removing those that were contextually irrelevant or logically inconsistent.
3. **First human review**: Individual expert reviewers assessed the expanded option sets for each question, checking that options were diverse, logically distinct, and free from ambiguity. Reviewers corrected issues or created new options as needed.
4. **Second human review**: Two additional human experts cross-validated every question and its options, eliminating any residual inconsistencies or errors.

This step also served as a quality check on the questions themselves. During the review process, experts identified and removed 70 questions that lacked a clear connection to their images or contained coherence issues, bringing the total from 1,800 to 1,730 questions.[1]

### Step 3: Vision-Only Input Setting

The third step introduces a fundamentally different way of presenting questions to models. Instead of providing the question text and image as separate inputs, the entire question (including text, answer options, and any associated images) is embedded within a single screenshot or photograph. As the paper describes it, "the model is presented with a question embedded within a screenshot or photo, without any text explicitly fed into the model."[1] This forces models to "see" and "read" simultaneously, mirroring how users often capture screenshots of questions in real-world scenarios rather than typing out the text separately.

Human annotators manually captured these images over a simulated display environment. The process involved varying backgrounds, font styles, and font sizes to replicate the diversity of real-world conditions. By using different combinations of these visual elements, the researchers created a broad range of visual contexts that challenge models not only with the content of the questions but also with variability in presentation.[1]

This step produced a parallel set of 1,730 vision-only questions, bringing the total MMMU-Pro dataset to 3,460 items: 1,730 in standard format and 1,730 in vision-only format.[1]

## Dataset Overview

MMMU-Pro retains the same disciplinary structure as the original MMMU, covering six core academic disciplines, 30 subjects, and 183 subfields.[1]

| Property | Value |
|---|---|
| Total questions | 3,460 (1,730 standard + 1,730 vision-only) |
| Subjects | 30 |
| Disciplines | 6 |
| Subfields | 183 |
| Questions per subject | Approximately 57-58 (60 pre-filtering, 70 removed in Step 2) |
| Answer options | 10 (increased from 4 in original MMMU) |
| Image types | 30+ heterogeneous types (charts, diagrams, maps, tables, chemical structures, music sheets, etc.) |

### Academic Disciplines

The six core disciplines covered by MMMU-Pro are:[1]

| Discipline | Example Subjects |
|---|---|
| Art and Design | Art history, design theory, visual arts |
| Business | Accounting, economics, finance, marketing |
| Science | Biology, chemistry, physics, mathematics |
| Health and Medicine | Clinical medicine, anatomy, pharmacy |
| Humanities and Social Science | History, psychology, sociology, philosophy |
| Tech and Engineering | Computer science, electrical engineering, mechanical engineering |

## Baseline Performance

To establish reference points, the researchers measured performance under naive guessing strategies:[1]

| Strategy | 4 Options | 10 Options |
|---|---|---|
| Random choice | 24.9% | 12.8% |
| Frequent choice | 27.8% | 12.1% |

The substantial drop in guessing baselines from the 4-option to 10-option setting confirms that the option augmentation step effectively reduces the influence of chance on benchmark scores.

## Model Performance (Original Paper)

The initial evaluation in the MMMU-Pro paper tested a range of proprietary and open-source models. Performance was measured across three settings: standard with 4 options (matching the original MMMU format), standard with 10 options (the full MMMU-Pro format), and vision-only with 10 options.[1]

### Proprietary Models

| Model | Standard (4 Options) | Standard (10 Options) | Vision-Only | MMMU Val | Drop from MMMU |
|---|---|---|---|---|---|
| [GPT-4o](/wiki/gpt_4o) (0513) | 64.7% | 54.0% | 49.7% | 69.1% | 15.1% to 19.4% |
| [Claude](/wiki/claude) 3.5 Sonnet | 63.7% | 55.0% | 48.0% | 68.3% | 13.3% to 20.3% |
| [Gemini](/wiki/gemini) 1.5 Pro (0801) | 60.6% | 49.4% | 44.4% | 65.8% | 16.4% to 21.4% |
| [Gemini](/wiki/gemini) 1.5 Pro (0523) | 57.6% | 46.5% | 40.5% | 62.2% | 15.7% to 21.7% |
| GPT-4o mini | 55.3% | 39.9% | 35.2% | 59.4% | 19.5% to 24.2% |

### Open-Source Models

| Model | Standard (4 Options) | Standard (10 Options) | Vision-Only | MMMU Val |
|---|---|---|---|---|
| [Qwen](/wiki/qwen) 2-VL-72B | 59.3% | 49.2% | 43.3% | 64.5% |
| InternVL2-[Llama](/wiki/llama) 3-76B | 55.0% | 41.9% | 38.0% | 58.3% |
| LLaVA-OneVision-72B | 52.3% | 38.0% | 24.0% | 56.8% |
| InternVL2-40B | 47.4% | 36.3% | 32.1% | 55.2% |
| LLaVA-NEXT-72B | N/A | 31.0% | 19.2% | N/A |
| VILA-1.5-40B | N/A | 35.9% | 14.1% | N/A |
| [Phi](/wiki/phi)-3.5-Vision | N/A | 26.3% | 13.1% | N/A |
| LLaVA-NEXT-7B | N/A | 19.4% | 14.6% | N/A |

Several patterns emerge from these results. First, every model experienced a significant performance drop when moving from MMMU to MMMU-Pro, confirming that the original benchmark overestimated multimodal capabilities. Second, the vision-only setting consistently proved harder than the standard 10-option setting, indicating that models struggle when they cannot receive question text as a separate input. Third, the gap between proprietary and open-source models widened on MMMU-Pro, suggesting that the harder benchmark better differentiates model capabilities.[1]

## Key Findings and Analysis

### Impact of Option Augmentation

Increasing the number of answer choices from four to ten had a pronounced effect on accuracy. For example, GPT-4o experienced a drop of 10.7 percentage points (from 64.7% to 54.0%) when moving from 4 to 10 options in the standard setting. This confirms that a meaningful portion of performance on the original MMMU came from effective elimination of implausible options rather than genuine understanding.[1]

### Does OCR ability predict multimodal reasoning?

The researchers tested whether providing explicit OCR (optical character recognition) prompts would help models perform better on the vision-only setting. The results showed that OCR prompts had minimal impact on performance:[1]

| Model | OCR Accuracy | Vision with OCR | Vision without OCR |
|---|---|---|---|
| [GPT-4o](/wiki/gpt_4o) | 92.3% | 49.7% | 49.4% |
| [Gemini](/wiki/gemini) 1.5 Pro (0801) | 89.7% | 44.4% | 43.6% |
| GPT-4o mini | 89.6% | 35.2% | 35.6% |
| InternVL2-[Llama](/wiki/llama) 3-76B | 88.1% | 38.0% | 37.9% |
| LLaVA-OneVision-72B | 87.8% | 24.0% | 23.8% |

A critical insight from this analysis is that high OCR accuracy does not translate into strong multimodal reasoning. LLaVA-OneVision-72B matched InternVL2-Llama3-76B and GPT-4o mini in OCR accuracy but lagged substantially behind both in MMMU-Pro Vision performance. According to the paper, "text recognition and OCR do not prove to be the primary bottleneck," and the main challenges lie "in the integration and interpretation of visual and textual information."[1] This indicates that the bottleneck for multimodal understanding lies in reasoning and integration of information, not in the ability to extract text from images.

### Chain-of-Thought Prompting Effects

[Chain-of-thought](/wiki/chain_of_thought) (CoT) prompting, which encourages models to show their reasoning step by step, generally improved performance on MMMU-Pro. However, the benefit varied dramatically across disciplines and models.[1]

For GPT-4o, CoT prompting showed clear benefits in reasoning-heavy domains:

| Discipline | CoT Impact (GPT-4o) | CoT Impact (LLaVA-OneVision-72B) |
|---|---|---|
| Art and Design | +1.58% | -17.12% |
| Business | +14.66% | +4.76% |
| Science | +8.22% | +1.28% |
| Health and Medicine | +5.34% | -1.56% |
| Humanities and Social Science | +2.21% | -4.46% |
| Tech and Engineering | +14.49% | +2.33% |

GPT-4o benefited substantially from CoT across nearly all disciplines, with the largest gains in Business (+14.66%) and Tech and Engineering (+14.49%). In contrast, LLaVA-OneVision-72B experienced negative effects from CoT in Art and Design (-17.12%), Health and Medicine (-1.56%), and Humanities and Social Science (-4.46%). This suggests that weaker models may generate incorrect reasoning chains that lead them further from the correct answer, while stronger models can leverage structured reasoning to improve accuracy.[1]

### Error Analysis

The researchers analyzed 60 error cases from GPT-4o on MMMU-Pro and compared the error distribution to the original MMMU. Reasoning errors accounted for 46% of failures on MMMU-Pro, a significant increase from 26% on the original MMMU.[1] This shift indicates that the benchmark successfully targets higher-order cognitive abilities. With text-only shortcuts removed and more answer options to evaluate, models must engage in genuine multimodal reasoning, and their failures more frequently stem from flawed reasoning rather than simple perception errors.

### Response Length Patterns

An interesting behavioral pattern emerged in the vision-only setting. GPT-4o generated significantly shorter responses when processing vision-only inputs, and a larger proportion of its output tokens were spent on description rather than analysis. The researchers hypothesize that the increased cognitive load of processing visual inputs forces the model to allocate more capacity to visual processing, which detracts from generating extensive reasoning chains.[1]

## How well do humans perform on MMMU-Pro?

The paper also reports approximate human performance levels on MMMU-Pro, broken down by expertise level:[1]

| Expertise Level | Overall | Art and Design | Business | Science | Health and Medicine | Humanities | Tech and Engineering |
|---|---|---|---|---|---|---|---|
| Low | 73.0% | 77.4% | 77.9% | 78.5% | 65.2% | 63.6% | 73.5% |
| Medium | 80.8% | 83.3% | 88.4% | 84.9% | 72.8% | 75.8% | 78.2% |
| High | 85.4% | 85.7% | 89.5% | 86.0% | 84.8% | 81.8% | 84.4% |

Even high-expertise humans achieved only 85.4% overall accuracy, confirming that MMMU-Pro contains genuinely challenging questions. Health and Medicine proved the most difficult discipline for lower-expertise humans (65.2%), while Business was the easiest across all expertise levels. The gap between the best-performing AI model at the time of publication (Claude 3.5 Sonnet at 55.0% in standard 10-option mode) and high-expertise humans (85.4%) was over 30 percentage points, indicating substantial room for improvement.[1]

## Vision Encoder Analysis

The paper includes an analysis of how different [vision transformer](/wiki/vision_transformer) encoders affect MMMU-Pro performance:[1]

| Vision Encoder | MMMU Val | MMMU-Pro Vision |
|---|---|---|
| DINOv2 ViT-G-14 | 37.1% | 17.4% |
| SigLIP ViT-SO400M-14 | 37.9% | 16.7% |

Both encoders performed similarly on the original MMMU validation set but diverged slightly on MMMU-Pro Vision, with DINOv2 showing a marginal advantage. The overall low scores in both cases underscore that vision encoders alone, without strong language reasoning capabilities, are insufficient for the complex tasks in MMMU-Pro.

## Updated Leaderboard (2025-2026)

Since the original paper's publication, newer and more capable models have been evaluated on MMMU-Pro. As of late May 2026, the leaderboard reflects dramatic progress, with frontier models now approaching and in some cases exceeding high-expertise human performance (85.4%):[4][5][7]

| Rank | Model | Developer | MMMU-Pro Score |
|---|---|---|---|
| 1 | GPT-5.4 Pro | [OpenAI](/wiki/openai) | 94.0% |
| 2 | Claude Mythos Preview | [Anthropic](/wiki/anthropic) | 92.7% |
| 3 | [Gemini](/wiki/gemini) 3.1 Pro | Google | 83.9% |
| 4 | [Gemini](/wiki/gemini) 3.5 Flash | Google | 83.6% |
| 5 | GPT-5.5 | [OpenAI](/wiki/openai) | 81.2% |
| 6 | GPT-5.4 | [OpenAI](/wiki/openai) | 81.2% |
| 7 | [Gemini](/wiki/gemini) 3 Pro | Google | 81.0% |
| 8 | Muse Spark | [Meta](/wiki/meta) | 80.4% |
| 9 | GPT-5.2 | [OpenAI](/wiki/openai) | 79.5% |
| 10 | Kimi K2.6 | [Moonshot AI](/wiki/moonshot_ai) | 79.4% |
| 11 | [Qwen](/wiki/qwen) 3.5-397B | Alibaba | 79.0% |
| 12 | [Qwen](/wiki/qwen) 3.6 Plus | Alibaba | 78.8% |
| 13 | Kimi K2.5 | [Moonshot AI](/wiki/moonshot_ai) | 78.5% |
| 14 | Grok 4.3 | xAI | 78.1% |
| 15 | [Claude](/wiki/claude) Opus 4.6 | [Anthropic](/wiki/anthropic) | 77.3% |

These results show dramatic improvements over the original paper's findings. The best frontier models now exceed human-expert-level performance (85.4%), with GPT-5.4 Pro at 94% and Claude Mythos Preview at 92.7% both surpassing that threshold. The benchmark therefore remains useful as a discriminator among frontier models, but the human performance ceiling has been crossed for the top tier.

### Additional Notable Scores

| Model | Developer | MMMU-Pro Score |
|---|---|---|
| [Qwen](/wiki/qwen) 3 VL 235B A22B (Thinking) | Alibaba | 69.3% |
| [Qwen](/wiki/qwen) 3 VL 32B (Thinking) | Alibaba | 68.1% |
| GPT-5.4 nano | [OpenAI](/wiki/openai) | 66.1% |
| [GPT-4o](/wiki/gpt_4o) | [OpenAI](/wiki/openai) | 59.9% |
| [Llama](/wiki/llama) 4 Maverick | [Meta](/wiki/meta) | 59.6% |
| Mistral Small 4 | Mistral AI | 60.0% |
| [Qwen](/wiki/qwen) 2.5 VL 72B Instruct | Alibaba | 51.1% |
| [Phi](/wiki/phi)-4-multimodal-instruct | Microsoft | 38.5% |
| [Llama](/wiki/llama) 3.2 90B Instruct | [Meta](/wiki/meta) | 45.2% |
| [Llama](/wiki/llama) 3.2 11B Instruct | [Meta](/wiki/meta) | 33.0% |

## How does MMMU-Pro differ from MMMU?

The following table summarizes the key differences between the original MMMU and MMMU-Pro:[1][2]

| Feature | MMMU | MMMU-Pro |
|---|---|---|
| Total questions | 11,500 | 3,460 |
| Answer options | 4 | 10 |
| Random guess baseline | ~25% | ~10% |
| Text-only filtering | No | Yes (questions answerable without images removed) |
| Vision-only setting | No | Yes (questions embedded in screenshots/photos) |
| Input format | Text + image (separate) | Standard (text + image) and vision-only |
| Publication | CVPR 2024 | ACL 2025 |
| Human quality control | Original annotation | Additional two rounds of expert review |
| Purpose | Broad multimodal evaluation | Robust multimodal reasoning assessment |

The most important conceptual shift is that MMMU-Pro specifically targets genuine multimodal reasoning. While the original MMMU tested a mixture of visual and non-visual knowledge, MMMU-Pro ensures that every question fundamentally requires visual understanding. This makes it a more reliable indicator of how well a model can truly integrate information across modalities.

## Impact and Significance

MMMU-Pro has had a significant influence on how the AI research community evaluates multimodal models. Several key contributions stand out.

First, the benchmark exposed a widespread overestimation of model capabilities. The 16.8% to 26.9% performance drops demonstrated that many models were relying on textual shortcuts rather than visual understanding.[1] This finding prompted researchers and developers to reconsider how they measure progress in multimodal AI.

Second, the vision-only input setting introduced a more realistic evaluation paradigm. In practice, users frequently share screenshots or photographs of content rather than carefully separating text and images. By testing models on this type of input, MMMU-Pro measures performance in scenarios that better reflect actual usage patterns.

Third, the finding that OCR accuracy does not predict multimodal reasoning performance highlighted an important research direction. Simply improving a model's ability to extract text from images is insufficient; the real challenge lies in understanding how textual and visual information relate to each other and reasoning about their combined meaning.[1]

Fourth, the nuanced analysis of [chain-of-thought](/wiki/chain_of_thought) prompting across disciplines provided practical guidance for deploying multimodal models. The finding that CoT can sometimes hurt performance in certain domains (particularly for weaker models) has implications for how prompting strategies should be tuned based on model capability and task domain.[1]

## Is MMMU-Pro publicly available?

Yes. MMMU-Pro is publicly available on Hugging Face at the MMMU/MMMU_Pro repository.[3] The dataset includes both the standard and vision-only question formats. Evaluation code is maintained on GitHub in the MMMU-Benchmark/MMMU repository.

## Authors and Affiliations

The MMMU-Pro paper was authored by Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, and Graham Neubig.[1] The work was primarily conducted while Xiang Yue was a postdoctoral researcher at Carnegie Mellon University, working with Graham Neubig. Other contributors were affiliated with institutions including the University of Waterloo, The Ohio State University, and others. The paper was published in the Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15134 to 15186, in Vienna, Austria, in July 2025.[6]

## See Also

- [MMMU](/wiki/mmmu)
- [Multimodal AI](/wiki/multimodal_ai)
- [Vision Language Model](/wiki/vision_language_model)
- [Chain-of-thought prompting](/wiki/chain_of_thought)
- [Computer Vision](/wiki/computer_vision)
- [Deep Learning](/wiki/deep_learning)

## References

1. Yue, X., Zheng, T., Ni, Y., Wang, Y., Zhang, K., Tong, S., Sun, Y., Yu, B., Zhang, G., Sun, H., Su, Y., Chen, W., & Neubig, G. (2025). MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark. *Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)*, 15134-15186. Vienna, Austria. arXiv:2409.02813. https://arxiv.org/abs/2409.02813
2. Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., Wei, C., Yu, B., Yuan, R., Sun, R., Yin, M., Zheng, B., Yang, Z., Liu, Y., Huang, W., Sun, H., Su, Y., & Chen, W. (2024). MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI. *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*, 9556-9567.
3. MMMU-Pro Dataset. Hugging Face. https://huggingface.co/datasets/MMMU/MMMU_Pro
4. MMMU-Pro Benchmark Leaderboard. Artificial Analysis. https://artificialanalysis.ai/evaluations/mmmu-pro
5. MMMU-Pro Leaderboard. LLM Stats. https://llm-stats.com/benchmarks/mmmu-pro
6. MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark. ACL Anthology. https://aclanthology.org/2025.acl-long.736/
7. MMMU-Pro Benchmark 2026: 27 LLM scores. BenchLM.ai. https://benchlm.ai/benchmarks/mmmuPro (accessed May 2026).
8. MMMU Benchmark. https://mmmu-benchmark.github.io/