# MMMU

> Source: https://aiwiki.ai/wiki/mmmu
> Updated: 2026-06-21
> Categories: AI Benchmarks, Machine Learning, Multimodal AI
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**MMMU** (Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark) is a [multimodal AI](/wiki/multimodal_ai) benchmark of 11,550 college-level questions that pairs text with images to test expert knowledge and reasoning across 30 academic subjects and 6 disciplines. It was released in November 2023 by a team of 22 researchers led by Xiang Yue (Ohio State University) and Wenhu Chen (University of Waterloo), and the paper states that MMMU is "designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning."[1] As of June 21, 2026, the highest reported MMMU score is 86.0% (Qwen3.6 Plus), just inside the human expert range of 76.2% to 88.6%, so the benchmark is not yet saturated.[1][8]

| MMMU |
| --- |
| Overview |
| Full name | Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark |
| Abbreviation | MMMU |
| Description | A massive multi-discipline multimodal benchmark evaluating expert-level understanding and reasoning across college-level subjects |
| Release date | 2023-11 |
| Latest version | 1.0 |
| Benchmark updated | 2023-12-04 |
| Authors | Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, et al. (22 authors) |
| Organizations | Ohio State University, University of Waterloo, IN.AI Research, Carnegie Mellon University |
| Venue | CVPR 2024 (Oral) |
| Technical Details |
| Type | Multimodal Understanding, Expert Knowledge |
| Modality | Text, Vision (Images) |
| Task format | Multiple choice (94%), Open-ended (6%) |
| Number of questions | 11,550 |
| Data splits | Dev: 150, Validation: 900, Test: 10,500 |
| Subjects | 30 subjects across 183 subfields |
| Image types | 30+ heterogeneous types |
| Evaluation metric | [Accuracy](/wiki/accuracy), Zero-shot performance |
| Domains | Art & Design, [Business](/wiki/business), [Science](/wiki/science), Health & Medicine, Humanities & Social Science, Tech & Engineering |
| Languages | English |
| Performance |
| Human expert range | 76.2% (worst) to 82.6% (medium) to 88.6% (best) |
| Random guess baseline | 22.1% (validation) |
| SOTA score | 86.0% |
| SOTA model | Qwen3.6 Plus |
| SOTA date | 2026-06 |
| Saturated | No |
| Resources |
| Website | [Official website](https://mmmu-benchmark.github.io/) |
| Paper | [arXiv:2311.16502](https://arxiv.org/abs/2311.16502) |
| GitHub | [Repository](https://github.com/MMMU-Benchmark/MMMU) |
| Dataset | [Hugging Face](https://huggingface.co/datasets/MMMU/MMMU) |
| Evaluation server | [EvalAI](https://eval.ai/web/challenges/challenge-page/2179/leaderboard) |
| Successors | [MMMU-Pro](/wiki/mmmu-pro), Video-MMMU, CMMMU |

MMMU contains 11,550 meticulously collected questions sourced from college exams, quizzes, and textbooks.[1] It spans 30 subjects, 183 subfields, and six core academic disciplines, and features over 30 different image types ranging from standard photographs to specialized notations like chemical structures and music sheets.[1] The benchmark was presented as an oral paper at CVPR 2024 and has since become one of the most widely used evaluations for assessing [artificial general intelligence](/wiki/artificial_general_intelligence) capabilities in [multimodal](/wiki/multimodal_ai) contexts.[1]

## What is MMMU used for?

MMMU addresses a critical gap in AI evaluation by testing models on tasks that require both advanced visual perception and domain-specific knowledge reasoning. Unlike earlier multimodal benchmarks that focused on elementary visual understanding (identifying objects in photos, reading text from signs, or answering simple questions about natural images), MMMU demands college-level subject expertise combined with sophisticated reasoning about diverse visual content.[1] The questions mirror the kind of problems students encounter in university courses across the sciences, humanities, engineering, and professional fields.

The benchmark is named for its defining characteristics: "Massive" refers to the scale of 11,550 questions; "Multi-discipline" indicates coverage across six broad academic disciplines; "Multimodal" highlights that questions combine text and images; and "Understanding and Reasoning" reflects the higher-order cognitive skills required. The full title is the Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI, signaling the authors' goal of measuring progress toward human-expert-level [artificial general intelligence](/wiki/artificial_general_intelligence).[1] As the paper puts it, the authors hope "MMMU will stimulate the community to build next-generation multimodal foundation models towards expert artificial general intelligence."[1]

### Why was MMMU created?

The development of MMMU was motivated by several observations about the state of multimodal AI evaluation in 2023:

- Existing multimodal benchmarks primarily tested basic visual recognition (such as VQA, OK-VQA, and TextVQA) rather than requiring genuine expert knowledge. Models could often score well on these benchmarks without deep understanding of the visual content.
- Real-world professional tasks in medicine, engineering, finance, and other fields require interpreting complex visual information with domain-specific expertise. No existing benchmark captured this intersection well.
- Progress toward [artificial general intelligence](/wiki/artificial_general_intelligence) requires mastery of diverse academic disciplines, not just language fluency or basic pattern recognition. A benchmark was needed to track whether multimodal systems could approach the performance of educated human experts.
- Current models showed particularly poor generalization to less common visual formats like circuit diagrams, chemical structures, and musical notation, but no standardized evaluation existed to quantify this weakness.

The authors explicitly framed MMMU as a tool for the research community to measure progress toward "expert AGI," arguing that the ability to reason across multiple academic disciplines using varied visual inputs represents a meaningful milestone on the path to general intelligence.[1]

## How was the MMMU dataset built?

### Collection Methodology

MMMU was built through a large-scale, human-driven collection effort. Over 50 college students from diverse academic backgrounds participated in gathering questions from textbooks, online educational resources, college exams, and lecture materials.[1] The collection process followed strict guidelines to ensure quality, diversity, and difficulty.

Each question in the dataset was required to include at least one image that is essential to answering the question correctly. This design choice ensures that models cannot simply rely on text-based reasoning to bypass the visual component.[1] The team collected questions from a wide range of sources, including university-level textbooks published by major academic publishers, past examination papers from accredited institutions, online educational platforms, and lecture slides from college courses.

### Quality Control Pipeline

The dataset underwent a multi-stage review process:

1. **Duplicate detection**: Automated tools and manual review were used to identify and remove duplicate or near-duplicate questions across the entire collection.
2. **Format standardization**: All questions were converted to a consistent format with clear question text, answer options (for multiple-choice), and associated images.
3. **Difficulty assessment**: Questions were categorized into difficulty levels. Approximately 10% of initially collected questions were excluded for being too easy, as the benchmark targets expert-level reasoning.
4. **Copyright compliance**: The team ensured that all included content respected intellectual property guidelines.
5. **Contamination prevention**: Questions with widely available or easily searchable answers were flagged and reviewed to reduce the risk of models having memorized answers during pre-training.

### Data Splits

The 11,550 questions are divided into three splits:

| Split | Size | Purpose |
| --- | --- | --- |
| Development (dev) | 150 | Few-shot and in-context learning experiments |
| Validation | 900 | Debugging models, selecting hyperparameters, and quick evaluations |
| Test | 10,500 | Official evaluation (answers withheld; submission via EvalAI server) |

The development set contains 5 questions per subject (150 total across 30 subjects). The validation set contains 30 questions per subject (900 total). The test set holds the remaining 10,500 questions, and its answer labels were kept private until February 2026, when the test set answers were publicly released.[1]

### Human Expert Baseline

To establish a meaningful human performance baseline, the MMMU team recruited 90 college senior students, with 3 experts assigned to each of the 30 subjects. Each expert completed 30 questions from the validation set within their discipline. Experts were allowed to consult their textbooks but were prohibited from searching the Internet for answers. Human expert accuracy on the validation set was reported at three levels: 76.2% (worst expert), 82.6% (medium expert), and 88.6% (best expert), against a random-guess baseline of 22.1%. This range provides a target for AI models to match or exceed.[1]

## Technical Specifications

### Discipline and Subject Coverage

MMMU covers six core academic disciplines, each containing multiple subjects:[1]

| Discipline | Subjects | Subject Count |
| --- | --- | --- |
| Art & Design | Art, Art Theory, Design, Music | 4 |
| [Business](/wiki/business) | Accounting, Economics, Finance, Management, Marketing | 5 |
| [Science](/wiki/science) | Biology, Chemistry, Geography, Math, Physics | 5 |
| Health & Medicine | Basic Medical Science, Clinical Medicine, Diagnostics & Laboratory Medicine, Pharmacy, Public Health | 5 |
| Humanities & Social Science | History, Literature, Psychology, Sociology | 4 |
| Tech & Engineering | Agriculture, Architecture & Engineering, [Computer Science](/wiki/computer_science), Electronics, Energy & Power, Materials, Mechanical Engineering | 7 |

The 30 subjects are further broken down into 183 subfields. For example, within Physics, subfields include classical mechanics, thermodynamics, electromagnetism, optics, and quantum physics. This granularity ensures that the benchmark captures a wide spectrum of college-level knowledge.

### Image Type Diversity

One of MMMU's distinguishing features is its inclusion of over 30 heterogeneous image types. Most prior benchmarks focused on natural photographs, but MMMU deliberately includes many specialized visual formats that professionals and students encounter in their fields:[1]

| Category | Image Types | Typical Disciplines |
| --- | --- | --- |
| Photographs & artwork | Natural photos, paintings, sculptures, sketches | Art & Design, Humanities |
| Scientific diagrams | Biological diagrams, chemical structures, physics diagrams, molecular models | Science, Health & Medicine |
| Data visualizations | Bar charts, line graphs, pie charts, heatmaps, scatter plots, tables | Business, Science, Engineering |
| Technical drawings | Circuit diagrams, architectural blueprints, flowcharts, engineering schematics | Tech & Engineering |
| Maps & geography | Topographic maps, political maps, climate maps, geological cross-sections | Science, Humanities |
| Specialized notation | Music sheets, mathematical proofs, code snippets | Art & Design, Science, Engineering |
| Medical imagery | X-rays, MRI scans, CT scans, histopathology slides, microscopy images | Health & Medicine |
| 3D representations | 3D models, CAD renderings, crystal structures | Engineering, Science |

Each question can include up to seven images, allowing the benchmark to test reasoning about complex multi-image scenarios such as comparing two X-rays or analyzing a series of related diagrams.[1]

### Question Types and Formats

| Question Type | Approximate Percentage | Description |
| --- | --- | --- |
| Multiple choice | ~94% | Select the correct answer from 4 or 5 options |
| Open-ended | ~6% | Provide a short numerical or textual answer |

The heavy emphasis on multiple-choice questions allows for automated and unambiguous evaluation. Open-ended questions are included to test whether models can generate correct answers without the benefit of answer choices.

### Evaluation Methodology

#### Zero-shot Protocol

MMMU employs strict zero-shot evaluation:

- No fine-tuning on MMMU data is allowed.
- No few-shot examples are provided during testing.
- Models must rely entirely on capabilities acquired during pre-training and general instruction tuning.
- This protocol ensures fair comparison across different model families and training approaches.

#### Skill Dimensions

The benchmark is designed to evaluate three core skill dimensions:

| Skill | Description | What It Tests |
| --- | --- | --- |
| Perception | Accurately interpreting visual information from diverse image types | Can the model correctly read a chart, identify a chemical structure, or parse a circuit diagram? |
| Knowledge | Domain-specific factual understanding at the college level | Does the model know the relevant facts, formulas, definitions, or historical context? |
| Reasoning | Logical inference, problem-solving, and multi-step deduction | Can the model combine visual evidence with domain knowledge to derive the correct answer? |

Many MMMU questions require all three skills simultaneously. For instance, a question about organic chemistry might require recognizing a molecular structure (perception), knowing reaction mechanisms (knowledge), and predicting the product of a specific reaction (reasoning).

#### Difficulty Levels

Questions in MMMU are categorized by difficulty:

| Difficulty | GPT-4V Accuracy (original paper) | Description |
| --- | --- | --- |
| Easy | 76.1% | Straightforward questions requiring basic recognition and recall |
| Medium | 55.6% | Questions needing moderate domain knowledge and multi-step reasoning |
| Hard | Near random performance | Complex questions requiring deep expertise and sophisticated reasoning |

The sharp drop-off from Easy to Hard questions illustrates that even advanced models struggle significantly once genuine expert-level reasoning is required.

## How well do AI models score on MMMU?

### Original Paper Results (November 2023)

When MMMU was first released, the authors evaluated a range of proprietary and open-source [large multimodal models](/wiki/multimodal_ai). The results revealed a substantial gap between the best models and human experts. The paper noted that "even the advanced GPT-4V only achieves a 56% accuracy, indicating significant room for improvement."[1]

| Model | Overall Accuracy | Notes |
| --- | --- | --- |
| Human experts | 76.2% to 88.6% | 90 college seniors across 30 subjects |
| Gemini Ultra | 59.4% | Google's top multimodal model at the time |
| [GPT-4V](/wiki/gpt-4) | 56.8% | [OpenAI](/wiki/openai)'s multimodal model |
| BLIP2-FLAN-T5-XXL | ~34% | Leading open-source model at the time |
| LLaVA-1.5 | ~34% | Open-source multimodal model |
| Random guess | 22.3% | Baseline for multiple-choice questions |

Key findings from the original evaluation:

1. Even the strongest proprietary models fell well short of average human expert performance.
2. Open-source models (ranging from 13B to 34B parameters) scored roughly 20 percentage points below GPT-4V, averaging around 34% accuracy.
3. Models performed best in Humanities and Art & Design, where visual complexity tends to be lower and language-based reasoning carries more weight.
4. All models struggled significantly with uncommon image types such as geometric shapes, music sheets, and chemical structures, sometimes performing near random chance.

### What is the current MMMU state of the art?

Since its release, MMMU has been widely adopted as a standard evaluation benchmark. Top model scores have improved substantially, with the best systems now exceeding 85% accuracy. As of June 21, 2026, the highest reported MMMU score on the LLM-Stats leaderboard is 86.0%, held by Alibaba's Qwen3.6 Plus, which narrowly edges out the GPT-5.1 family at 85.4%. The following table shows a selection of notable scores from the current leaderboard:[8]

| Rank | Model | Organization | Score |
| --- | --- | --- | --- |
| 1 | Qwen3.6 Plus | [Alibaba](/wiki/alibaba_cloud) | 86.0% |
| 2 | GPT-5.1 | [OpenAI](/wiki/openai) | 85.4% |
| 5 | [GPT-5](/wiki/gpt-5) | [OpenAI](/wiki/openai) | 84.2% |
| 6 | Qwen3.5-122B-A10B | [Alibaba](/wiki/alibaba_cloud) | 83.9% |
| 7 | [o3](/wiki/openai_o-series) | [OpenAI](/wiki/openai) | 82.9% |
| 10 | [Gemini 2.5 Pro](/wiki/gemini) | [Google](/wiki/google) | 82.0% |
| 11 | [Gemini 2.5 Flash](/wiki/gemini) | [Google](/wiki/google) | 79.7% |
| 14 | [Grok-3](/wiki/grok) | [xAI](/wiki/xai) | 78.0% |
| 15 | [o1](/wiki/openai_o-series) | [OpenAI](/wiki/openai) | 77.6% |
| 18 | [Claude 3.7 Sonnet](/wiki/claude) | [Anthropic](/wiki/anthropic) | 75.0% |
| 20 | [Claude Sonnet 4](/wiki/claude) | [Anthropic](/wiki/anthropic) | 74.4% |
| 24 | [GPT-4o](/wiki/gpt-4) | [OpenAI](/wiki/openai) | 72.2% |
| 27 | Qwen2.5 VL 72B | [Alibaba](/wiki/alibaba_cloud) | 70.2% |
| 31 | [Claude 3.5 Sonnet](/wiki/claude) | [Anthropic](/wiki/anthropic) | 68.3% |
| 34 | [Gemini 1.5 Pro](/wiki/gemini) | [Google](/wiki/google) | 65.9% |
| 40 | [Llama 3.2 90B](/wiki/llama) | [Meta](/wiki/meta_ai) | 60.3% |

Several models now surpass the lower end of human expert performance (76.2%), but the best human experts (88.6%) still outperform all current systems. The top-performing model, Qwen3.6 Plus at 86.0%, falls within the human expert range of 76.2% to 88.6% and is approaching the upper bound, which is why MMMU is still listed as unsaturated.[8]

### Performance by Discipline

Models consistently show uneven performance across the six disciplines. Humanities and Social Science questions tend to yield the highest scores, while Tech and Engineering questions remain the most challenging:

| Discipline | Typical Top-Model Range | Key Challenge |
| --- | --- | --- |
| Humanities & Social Science | 75% to 85% | Requires cultural and historical knowledge but visual complexity is lower |
| Art & Design | 70% to 80% | Demands aesthetic judgment and art history knowledge |
| [Business](/wiki/business) | 68% to 78% | Financial charts and accounting problems |
| Health & Medicine | 65% to 78% | Complex medical imagery and clinical reasoning |
| [Science](/wiki/science) | 60% to 72% | Diverse scientific diagrams and mathematical reasoning |
| Tech & Engineering | 50% to 65% | Circuit diagrams, engineering schematics, and code |

### Performance by Image Type

The type of visual content in a question has a major impact on model accuracy. Models trained primarily on natural images and web content tend to struggle with specialized visual formats:

| Image Type | Best Model Performance | Worst Model Performance | Key Insight |
| --- | --- | --- | --- |
| Photos and paintings | 75% to 85% | 40% to 50% | Most familiar image type during training |
| Charts and graphs | 65% to 80% | 35% to 45% | Requires precise numerical reading |
| Chemical structures | 40% to 55% | 15% to 25% | Specialized domain notation |
| Circuit diagrams | 35% to 50% | Near random | Very limited training exposure |
| Music sheets | 25% to 40% | Near random | Extremely rare in training data |
| Geometric shapes | 30% to 45% | Near random | Requires spatial reasoning |

### Error Analysis

The MMMU authors and subsequent researchers have identified four primary categories of model errors:[1]

| Error Type | Frequency | Description |
| --- | --- | --- |
| Perception errors | ~30% | Misinterpreting visual elements (misreading a chart value, confusing parts of a diagram) |
| Knowledge gaps | ~35% | Lacking the domain-specific information needed to answer correctly |
| Reasoning failures | ~25% | Applying incorrect logical inference or making computational mistakes |
| Integration errors | ~10% | Failing to properly combine visual and textual information |

These error categories are not mutually exclusive. A single incorrect answer may involve both a perception error (misreading part of an image) and a reasoning failure (drawing an incorrect conclusion from the misread data).

## MMMU-Pro

### Overview

MMMU-Pro is a more challenging successor benchmark introduced in September 2024 by a largely overlapping team of researchers, including Xiang Yue, Tianyu Zheng, Yuansheng Ni, and others. The paper was accepted at ACL 2025.[2] MMMU-Pro was designed to address limitations in the original MMMU by filtering out questions that could be solved through shortcuts and introducing harder evaluation conditions.[2] The authors report that across models, accuracy fell from 16.8% to 26.9% relative to MMMU once these harder conditions were applied.[2]

### Three-Step Construction Process

MMMU-Pro was constructed from the original MMMU dataset through a rigorous three-step process:[2]

**Step 1: Text-only filtering.** The team used four strong open-source LLMs (Llama3-70B-Instruct, Qwen2-72B-Instruct, Yi-1.5-34B-Chat, and Mixtral-8x22B-Instruct) to identify questions that could be answered correctly without seeing the image. Each model attempted each question text-only across ten trials. Questions that were answered correctly by at least three of the four models more than five times were excluded. This process ensured that every remaining question genuinely requires visual understanding.[2]

**Step 2: Augmenting candidate options.** For the remaining questions, human experts working alongside [GPT-4o](/wiki/gpt-4) expanded the multiple-choice options from 4 to 10. This makes random guessing far less effective (10% chance versus 25%) and forces models to discriminate among more plausible distractors. During this phase, 70 additional questions were removed because the image-question relevance was insufficient, leaving 1,730 standard-format questions.[2]

**Step 3: Vision-only input setting.** Human annotators manually captured screenshots and photographs of the questions displayed on screens, with varying backgrounds, font styles, and font sizes. This created a parallel set of 1,730 "vision-only" questions where the model must extract the question text from the image itself, testing integrated visual and textual processing without separate text input.[2]

The final MMMU-Pro dataset contains 3,460 questions (1,730 standard + 1,730 vision-only), evenly distributed across the same 30 subjects as the original MMMU (approximately 60 questions per subject before vision-only duplication).[2]

### MMMU-Pro Results

Performance on MMMU-Pro is dramatically lower than on the original MMMU. The following table shows results from the original MMMU-Pro paper:[2]

| Model | MMMU (Val) | MMMU-Pro (Overall) | Performance Drop |
| --- | --- | --- | --- |
| [GPT-4o](/wiki/gpt-4) | 69.1% | 51.9% | -17.2 pp |
| [Claude 3.5 Sonnet](/wiki/claude) | 68.3% | 51.5% | -16.8 pp |
| [Gemini 1.5 Pro](/wiki/gemini) | 65.8% | 46.9% | -18.9 pp |
| Qwen2-VL-72B | 64.5% | 46.2% | -18.3 pp |
| VILA-1.5-40B | 51.9% | 25.0% | -26.9 pp |

The sharp performance drops demonstrate that a significant portion of MMMU accuracy came from shortcuts and guessing strategies rather than genuine multimodal understanding.

### MMMU-Pro Key Findings

- **Option augmentation is highly effective.** Expanding from 4 to 10 choices reduced GPT-4o's accuracy by 10.7 percentage points even before applying the vision-only setting, showing that models relied partly on process-of-elimination strategies.[2]
- **Vision-only input is challenging.** Embedding questions within images caused additional performance drops, particularly for models like LLaVA-OneVision-72B (14.0 percentage point decrease), indicating that many models struggle with integrated text-image processing.[2]
- **OCR prompts have minimal effect.** Explicitly prompting models to perform OCR on the vision-only inputs did not significantly improve performance, suggesting that capable models already extract text effectively but struggle with the deeper reasoning challenges.[2]
- **Chain of Thought helps selectively.** [Chain of Thought](/wiki/chain_of_thought) prompting improved some models substantially (Claude 3.5 Sonnet rose from 42.7% to 55.0% in the standard setting) but hurt others, particularly smaller models with weaker instruction-following abilities.[2]

### Current MMMU-Pro Leaderboard (June 2026)

As models have improved, MMMU-Pro scores have risen considerably from the original paper's results. As of June 21, 2026, the LLM-Stats MMMU-Pro leaderboard is led by Google's Gemini 3.5 Flash at 83.6%, ahead of OpenAI's GPT-5.5 at 83.2%:[9]

| Rank | Model | Organization | Score |
| --- | --- | --- | --- |
| 1 | [Gemini 3.5 Flash](/wiki/gemini) | [Google](/wiki/google) | 83.6% |
| 2 | GPT-5.5 | [OpenAI](/wiki/openai) | 83.2% |
| 3 | Gemini 3 Flash | [Google](/wiki/google) | 81.2% |
| 3 | GPT-5.4 | [OpenAI](/wiki/openai) | 81.2% |
| 5 | Gemini 3 Pro | [Google](/wiki/google) | 81.0% |
| 7 | [GPT-5](/wiki/gpt-5) | [OpenAI](/wiki/openai) | 78.4% |
| 8 | Claude Opus 4.6 | [Anthropic](/wiki/anthropic) | 77.3% |
| 12 | [o3](/wiki/openai_o-series) | [OpenAI](/wiki/openai) | 76.4% |
| 25 | [GPT-4o](/wiki/gpt-4) | [OpenAI](/wiki/openai) | 59.9% |

## MMMU Family of Benchmarks

The success of MMMU has led to the development of several related benchmarks, forming a broader "MMMU family" that evaluates different aspects of multimodal understanding:

### CMMMU (Chinese MMMU)

CMMMU (Chinese Massive Multi-discipline Multimodal Understanding) was released in early 2024 as a Chinese-language counterpart to MMMU. It contains approximately 12,000 manually collected multimodal questions covering the same six disciplines and 30 subjects as MMMU, but sourced from Chinese educational curricula. CMMMU includes 39 heterogeneous image types and tests models on Chinese-specific academic content. Even GPT-4V only achieved approximately 42% accuracy on CMMMU, highlighting the additional challenge of non-English academic evaluation.[4]

### Video-MMMU

Video-MMMU extends the MMMU paradigm to video understanding. Developed by researchers at Nanyang Technological University and Carnegie Mellon University, Video-MMMU contains 300 expert-level, college-level lecture videos and 900 human-annotated questions across the same six disciplines and 30 subjects.[3] The benchmark evaluates knowledge acquisition through three cognitive stages: Perception (identifying key information), Comprehension (understanding underlying concepts), and Adaptation (applying knowledge to novel scenarios). A novel metric called delta-knowledge measures how much a model's performance improves after watching an educational video. Human learners achieved a 33.1% knowledge gain, while GPT-4o achieved only 15.6% and Claude 3.5 Sonnet achieved 11.4%, revealing a significant gap in video-based learning capabilities.[3]

### Uni-MMMU

Uni-MMMU, published in late 2025, is a unified benchmark that tests bidirectional synergy between generation and understanding across eight reasoning-centric domains including science, coding, mathematics, and puzzles. Each task is bidirectionally coupled, requiring models either to leverage conceptual understanding to guide precise visual synthesis or to use generation as a cognitive scaffold for analytical reasoning.

## Applications and Impact

### Significance in AI Research

MMMU has become one of the most widely cited and used benchmarks for multimodal AI evaluation since its release. Its significance stems from several factors:

- It was the first large-scale benchmark to systematically test multimodal models on college-level expert knowledge across multiple disciplines.
- Major AI companies including [OpenAI](/wiki/openai), [Google](/wiki/google), [Anthropic](/wiki/anthropic), and [Meta](/wiki/meta_ai) routinely report MMMU scores when releasing new multimodal models.
- The benchmark has helped identify specific weaknesses in vision-language models, such as poor handling of specialized visual formats and weak cross-modal reasoning.
- MMMU's design philosophy, prioritizing expert-level reasoning over basic visual understanding, has influenced the development of other challenging benchmarks.

### Research Applications

MMMU enables several lines of research:

- **Multimodal architecture design**: Identifying which model architectures best handle diverse visual inputs combined with knowledge-intensive reasoning.
- **Knowledge integration**: Studying how to combine visual perception with domain expertise during both pre-training and fine-tuning.
- **Visual generalization**: Understanding why models transfer poorly to uncommon image formats and developing training strategies to address this.
- **Curriculum learning**: Using MMMU's difficulty levels to design progressive training regimes.
- **Error analysis**: Diagnosing whether model failures stem from perception, knowledge, or reasoning limitations.

### Professional and Educational Applications

| Field | Application | MMMU Relevance |
| --- | --- | --- |
| Medicine | Diagnostic assistance and medical education | Medical image interpretation (X-rays, histopathology) |
| Engineering | Design validation and review | Technical drawing and schematic comprehension |
| Finance | Automated report analysis | Chart and data visualization understanding |
| Education | AI tutoring systems and automated assessment | Multi-discipline knowledge evaluation |
| Research | Scientific literature review | Scientific diagram and figure interpretation |

## Dataset Access and Usage

### Access Methods

MMMU is publicly available through Hugging Face Datasets:[6]

```python
from datasets import load_dataset

# Load the full MMMU dataset
dataset = load_dataset("MMMU/MMMU")

# Access specific subjects
accounting = load_dataset("MMMU/MMMU", "Accounting")
physics = load_dataset("MMMU/MMMU", "Physics")
```

The dataset is also available for direct download from the [Hugging Face repository](https://huggingface.co/datasets/MMMU/MMMU).[6]

### Evaluation Server

- **Launch date**: December 4, 2023
- **Platform**: EvalAI
- **URL**: [https://eval.ai/web/challenges/challenge-page/2179/leaderboard](https://eval.ai/web/challenges/challenge-page/2179/leaderboard)
- **Submission format**: JSON file with model predictions
- **Test set answers released**: February 12, 2026 (local evaluation now possible)

### Each Data Sample

Every question in MMMU includes the following fields:

| Field | Description |
| --- | --- |
| id | Unique identifier |
| question | Question text |
| options | Multiple-choice answer options (if applicable) |
| answer | Correct answer |
| explanation | Detailed explanation of the correct answer |
| image_1 to image_7 | Up to 7 associated images |
| img_type | Type classification of the primary image |
| topic_difficulty | Difficulty level (Easy, Medium, Hard) |
| question_type | Multiple choice or open-ended |
| subfield | Specific subfield within the subject |

## Limitations

| Limitation | Description | Impact |
| --- | --- | --- |
| English only | All questions are in English | Does not assess multilingual multimodal capabilities (though CMMMU addresses Chinese) |
| Static dataset | Fixed set of questions that does not change | Models could potentially overfit through repeated evaluation or data contamination |
| US-centric curriculum | Questions drawn primarily from US college materials | May not reflect educational standards in other countries |
| Limited interactivity | Single-turn question answering only | Does not test multi-turn dialogue or iterative problem solving |
| Mostly multiple choice | 94% of questions are multiple choice | May not fully capture depth of understanding; partial credit is not possible |
| No video or audio | Only static images and text | Does not test temporal reasoning or audio understanding (though Video-MMMU addresses this) |

## Future Directions

Several research directions build on the foundation laid by MMMU:

1. **Multilingual expansion**: Extending the benchmark to additional languages beyond English and Chinese. CMMMU covers Chinese, but benchmarks for other major languages remain needed.
2. **Dynamic evaluation**: Developing procedurally generated or frequently updated question sets to prevent contamination and memorization.
3. **Interactive evaluation**: Multi-turn reasoning tasks where models can ask clarifying questions or request additional information.
4. **Video and temporal reasoning**: Extending to video-based questions requiring understanding of temporal sequences and dynamic visual content.
5. **Fine-grained skill assessment**: More detailed breakdowns of perception, knowledge, and reasoning sub-skills to identify specific model weaknesses.
6. **Unified generation and understanding**: Benchmarks like Uni-MMMU that test whether models can both generate and understand visual content across disciplines.

## Related Benchmarks

- **[MMLU](/wiki/mmlu)**: Text-only multi-task language understanding benchmark covering 57 subjects
- **[MMLU-Pro](/wiki/mmlu-pro)**: Enhanced version of MMLU with harder questions and 10 answer options
- **MathVista**: Mathematical reasoning with visual inputs
- **ScienceQA**: Multimodal science questions from elementary to high school level
- **ChartQA**: Chart and data visualization understanding
- **AI2D**: Science diagram understanding
- **TextVQA**: Text reading in natural images
- **DocVQA**: Document understanding and visual question answering
- **[HumanEval](/wiki/humaneval)**: Code generation benchmark (text-only, for comparison)
- **VQA v2.0**: General visual question answering on natural images

## See Also

- [MM-BrowseComp](/wiki/mm_browsecomp)
- [MMStar](/wiki/mmstar)
- [Multimodal AI](/wiki/multimodal_ai)
- [Vision-Language Models](/wiki/vision_language_model)
- [Computer Vision](/wiki/computer_vision)
- [Artificial General Intelligence](/wiki/artificial_general_intelligence)
- [Large Language Models](/wiki/large_language_model)
- [MMLU](/wiki/mmlu)
- [GPT-4](/wiki/gpt-4)
- [Gemini](/wiki/gemini)

## References

1. Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., Wei, C., Yu, B., Yuan, R., Sun, R., Yin, M., Zheng, B., Yang, Z., Liu, Y., Huang, W., Sun, H., Su, Y., & Chen, W. (2024). "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024)*. [https://arxiv.org/abs/2311.16502](https://arxiv.org/abs/2311.16502)
2. Yue, X., Zheng, T., Ni, Y., Wang, Y., Zhang, K., Tong, S., Sun, Y., Yu, B., Zhang, G., Sun, H., Su, Y., Chen, W., & Neubig, G. (2025). "MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark." *Proceedings of ACL 2025*. [https://arxiv.org/abs/2409.02813](https://arxiv.org/abs/2409.02813)
3. Hu, K., Wu, P., Pu, F., Xiao, W., Yue, X., Zhang, Y., Li, B., & Liu, Z. (2025). "Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos." [https://videommmu.github.io/](https://videommmu.github.io/)
4. Zhang, G., Du, Y., et al. (2024). "CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark." [https://arxiv.org/abs/2401.11944](https://arxiv.org/abs/2401.11944)
5. MMMU Benchmark Official Website. [https://mmmu-benchmark.github.io/](https://mmmu-benchmark.github.io/)
6. MMMU Dataset on Hugging Face. [https://huggingface.co/datasets/MMMU/MMMU](https://huggingface.co/datasets/MMMU/MMMU)
7. MMMU GitHub Repository. [https://github.com/MMMU-Benchmark/MMMU](https://github.com/MMMU-Benchmark/MMMU)
8. MMMU Leaderboard, LLM-Stats (accessed June 21, 2026). [https://llm-stats.com/benchmarks/mmmu](https://llm-stats.com/benchmarks/mmmu)
9. MMMU-Pro Leaderboard, LLM-Stats (accessed June 21, 2026). [https://llm-stats.com/benchmarks/mmmu-pro](https://llm-stats.com/benchmarks/mmmu-pro)

