MMMU

**

MMMU
Overview
Full name	Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark
Abbreviation	MMMU
Description	A massive multi-discipline multimodal benchmark evaluating expert-level understanding and reasoning across college-level subjects
Release date	2023-11
Latest version	1.0
Benchmark updated	2023-12-04
Authors	Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, et al. (22 authors)
Organizations	Ohio State University, University of Waterloo, IN.AI Research, Carnegie Mellon University
Venue	CVPR 2024 (Oral)
Technical Details
Type	Multimodal Understanding, Expert Knowledge
Modality	Text, Vision (Images)
Task format	Multiple choice (94%), Open-ended (6%)
Number of questions	11,550
Data splits	Dev: 150, Validation: 900, Test: 10,500
Subjects	30 subjects across 183 subfields
Image types	30+ heterogeneous types
Evaluation metric	Accuracy, Zero-shot performance
Domains	Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, Tech & Engineering
Languages	English
Performance
Human expert range	76.2% to 88.6%
Random guess baseline	22.3%
SOTA score	85.4%
SOTA model	GPT-5.1
SOTA date	2025
Saturated	No
Resources
Website	Official website
Paper	arXiv:2311.16502
GitHub	Repository
Dataset	Hugging Face
Evaluation server	EvalAI
Successors	MMMU-Pro, Video-MMMU, CMMMU

MMMU** (Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark) is a comprehensive multimodal AI benchmark designed to evaluate models on expert-level understanding and reasoning across college-level academic subjects. Released in November 2023 by a team of 22 researchers led by Xiang Yue at Ohio State University and Wenhu Chen at the University of Waterloo, MMMU contains 11,550 meticulously collected questions sourced from college exams, quizzes, and textbooks. It spans 30 subjects, 183 subfields, and six core academic disciplines, and features over 30 different image types ranging from standard photographs to specialized notations like chemical structures and music sheets. The benchmark was presented as an oral paper at CVPR 2024 and has since become one of the most widely used evaluations for assessing artificial general intelligence capabilities in multimodal contexts.

Overview

MMMU addresses a critical gap in AI evaluation by testing models on tasks that require both advanced visual perception and domain-specific knowledge reasoning. Unlike earlier multimodal benchmarks that focused on elementary visual understanding (identifying objects in photos, reading text from signs, or answering simple questions about natural images), MMMU demands college-level subject expertise combined with sophisticated reasoning about diverse visual content. The questions mirror the kind of problems students encounter in university courses across the sciences, humanities, engineering, and professional fields.

The benchmark is named for its defining characteristics: "Massive" refers to the scale of 11,550 questions; "Multi-discipline" indicates coverage across six broad academic disciplines; "Multimodal" highlights that questions combine text and images; and "Understanding and Reasoning" reflects the higher-order cognitive skills required. The full title is the Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI, signaling the authors' goal of measuring progress toward human-expert-level artificial general intelligence.

Motivation

The development of MMMU was motivated by several observations about the state of multimodal AI evaluation in 2023:

Existing multimodal benchmarks primarily tested basic visual recognition (such as VQA, OK-VQA, and TextVQA) rather than requiring genuine expert knowledge. Models could often score well on these benchmarks without deep understanding of the visual content.
Real-world professional tasks in medicine, engineering, finance, and other fields require interpreting complex visual information with domain-specific expertise. No existing benchmark captured this intersection well.
Progress toward artificial general intelligence requires mastery of diverse academic disciplines, not just language fluency or basic pattern recognition. A benchmark was needed to track whether multimodal systems could approach the performance of educated human experts.
Current models showed particularly poor generalization to less common visual formats like circuit diagrams, chemical structures, and musical notation, but no standardized evaluation existed to quantify this weakness.

The authors explicitly framed MMMU as a tool for the research community to measure progress toward "expert AGI," arguing that the ability to reason across multiple academic disciplines using varied visual inputs represents a meaningful milestone on the path to general intelligence.

Dataset Construction

Collection Methodology

MMMU was built through a large-scale, human-driven collection effort. Over 50 college students from diverse academic backgrounds participated in gathering questions from textbooks, online educational resources, college exams, and lecture materials. The collection process followed strict guidelines to ensure quality, diversity, and difficulty.

Each question in the dataset was required to include at least one image that is essential to answering the question correctly. This design choice ensures that models cannot simply rely on text-based reasoning to bypass the visual component. The team collected questions from a wide range of sources, including university-level textbooks published by major academic publishers, past examination papers from accredited institutions, online educational platforms, and lecture slides from college courses.

Quality Control Pipeline

The dataset underwent a multi-stage review process:

Duplicate detection: Automated tools and manual review were used to identify and remove duplicate or near-duplicate questions across the entire collection.
Format standardization: All questions were converted to a consistent format with clear question text, answer options (for multiple-choice), and associated images.
Difficulty assessment: Questions were categorized into difficulty levels. Approximately 10% of initially collected questions were excluded for being too easy, as the benchmark targets expert-level reasoning.
Copyright compliance: The team ensured that all included content respected intellectual property guidelines.
Contamination prevention: Questions with widely available or easily searchable answers were flagged and reviewed to reduce the risk of models having memorized answers during pre-training.

Data Splits

The 11,550 questions are divided into three splits:

Split	Size	Purpose
Development (dev)	150	Few-shot and in-context learning experiments
Validation	900	Debugging models, selecting hyperparameters, and quick evaluations
Test	10,500	Official evaluation (answers withheld; submission via EvalAI server)

The development set contains 5 questions per subject (150 total across 30 subjects). The validation set contains 30 questions per subject (900 total). The test set holds the remaining 10,500 questions, and its answer labels were kept private until February 2026, when the test set answers were publicly released.

Human Expert Baseline

To establish a meaningful human performance baseline, the MMMU team recruited 90 college senior students, with 3 experts assigned to each of the 30 subjects. Each expert completed 30 questions from the validation set within their discipline. Experts were allowed to consult their textbooks but were prohibited from searching the Internet for answers. Human expert accuracy ranged from 76.2% to 88.6% across subjects, providing a target for AI models to match or exceed.

Technical Specifications

Discipline and Subject Coverage

MMMU covers six core academic disciplines, each containing multiple subjects:

Discipline	Subjects	Subject Count
Art & Design	Art, Art Theory, Design, Music	4
Business	Accounting, Economics, Finance, Management, Marketing	5
Science	Biology, Chemistry, Geography, Math, Physics	5
Health & Medicine	Basic Medical Science, Clinical Medicine, Diagnostics & Laboratory Medicine, Pharmacy, Public Health	5
Humanities & Social Science	History, Literature, Psychology, Sociology	4
Tech & Engineering	Agriculture, Architecture & Engineering, Computer Science, Electronics, Energy & Power, Materials, Mechanical Engineering	7

The 30 subjects are further broken down into 183 subfields. For example, within Physics, subfields include classical mechanics, thermodynamics, electromagnetism, optics, and quantum physics. This granularity ensures that the benchmark captures a wide spectrum of college-level knowledge.

Image Type Diversity

One of MMMU's distinguishing features is its inclusion of over 30 heterogeneous image types. Most prior benchmarks focused on natural photographs, but MMMU deliberately includes many specialized visual formats that professionals and students encounter in their fields:

Category	Image Types	Typical Disciplines
Photographs & artwork	Natural photos, paintings, sculptures, sketches	Art & Design, Humanities
Scientific diagrams	Biological diagrams, chemical structures, physics diagrams, molecular models	Science, Health & Medicine
Data visualizations	Bar charts, line graphs, pie charts, heatmaps, scatter plots, tables	Business, Science, Engineering
Technical drawings	Circuit diagrams, architectural blueprints, flowcharts, engineering schematics	Tech & Engineering
Maps & geography	Topographic maps, political maps, climate maps, geological cross-sections	Science, Humanities
Specialized notation	Music sheets, mathematical proofs, code snippets	Art & Design, Science, Engineering
Medical imagery	X-rays, MRI scans, CT scans, histopathology slides, microscopy images	Health & Medicine
3D representations	3D models, CAD renderings, crystal structures	Engineering, Science

Each question can include up to seven images, allowing the benchmark to test reasoning about complex multi-image scenarios such as comparing two X-rays or analyzing a series of related diagrams.

Question Types and Formats

Question Type	Approximate Percentage	Description
Multiple choice	~94%	Select the correct answer from 4 or 5 options
Open-ended	~6%	Provide a short numerical or textual answer

The heavy emphasis on multiple-choice questions allows for automated and unambiguous evaluation. Open-ended questions are included to test whether models can generate correct answers without the benefit of answer choices.

Evaluation Methodology

Zero-shot Protocol

MMMU employs strict zero-shot evaluation:

No fine-tuning on MMMU data is allowed.
No few-shot examples are provided during testing.
Models must rely entirely on capabilities acquired during pre-training and general instruction tuning.
This protocol ensures fair comparison across different model families and training approaches.

Skill Dimensions

The benchmark is designed to evaluate three core skill dimensions:

Skill	Description	What It Tests
Perception	Accurately interpreting visual information from diverse image types	Can the model correctly read a chart, identify a chemical structure, or parse a circuit diagram?
Knowledge	Domain-specific factual understanding at the college level	Does the model know the relevant facts, formulas, definitions, or historical context?
Reasoning	Logical inference, problem-solving, and multi-step deduction	Can the model combine visual evidence with domain knowledge to derive the correct answer?

Many MMMU questions require all three skills simultaneously. For instance, a question about organic chemistry might require recognizing a molecular structure (perception), knowing reaction mechanisms (knowledge), and predicting the product of a specific reaction (reasoning).

Difficulty Levels

Questions in MMMU are categorized by difficulty:

Difficulty	GPT-4V Accuracy (original paper)	Description
Easy	76.1%	Straightforward questions requiring basic recognition and recall
Medium	55.6%	Questions needing moderate domain knowledge and multi-step reasoning
Hard	Near random performance	Complex questions requiring deep expertise and sophisticated reasoning

The sharp drop-off from Easy to Hard questions illustrates that even advanced models struggle significantly once genuine expert-level reasoning is required.

Performance Analysis

Original Paper Results (November 2023)

When MMMU was first released, the authors evaluated a range of proprietary and open-source large multimodal models. The results revealed a substantial gap between the best models and human experts:

Model	Overall Accuracy	Notes
Human experts	76.2% to 88.6%	90 college seniors across 30 subjects
Gemini Ultra	59.4%	Google's top multimodal model at the time
GPT-4V	56.8%	OpenAI's multimodal model
BLIP2-FLAN-T5-XXL	~34%	Leading open-source model at the time
LLaVA-1.5	~34%	Open-source multimodal model
Random guess	22.3%	Baseline for multiple-choice questions

Key findings from the original evaluation:

Even the strongest proprietary models fell well short of average human expert performance.
Open-source models (ranging from 13B to 34B parameters) scored roughly 20 percentage points below GPT-4V, averaging around 34% accuracy.
Models performed best in Humanities and Art & Design, where visual complexity tends to be lower and language-based reasoning carries more weight.
All models struggled significantly with uncommon image types such as geometric shapes, music sheets, and chemical structures, sometimes performing near random chance.

Current Leaderboard (as of March 2026)

Since its release, MMMU has been widely adopted as a standard evaluation benchmark. Top model scores have improved substantially, with the best systems now exceeding 85% accuracy. The following table shows a selection of notable scores from the current leaderboard:

Rank	Model	Organization	Score
1	GPT-5.1	OpenAI	85.4%
4	GPT-5	OpenAI	84.2%
5	Qwen3.5-122B-A10B	Alibaba	83.9%
6	o3	OpenAI	82.9%
8	Gemini 2.5 Pro	Google	82.0%
9	o4-mini	OpenAI	81.6%
11	Gemini 2.5 Flash	Google	79.7%
14	Grok-3	xAI	78.0%
15	o1	OpenAI	77.6%
18	Claude 3.7 Sonnet	Anthropic	75.0%
20	Claude Sonnet 4	Anthropic	74.4%
24	GPT-4o	OpenAI	72.2%
27	Qwen2.5 VL 72B	Alibaba	70.2%
31	Claude 3.5 Sonnet	Anthropic	68.3%
34	Gemini 1.5 Pro	Google	65.9%
40	Llama 3.2 90B	Meta	60.3%

Several models now surpass the lower end of human expert performance (76.2%), but the best human experts still outperform all current systems. The top-performing model, GPT-5.1, reaches 85.4%, which falls within the human expert range of 76.2% to 88.6%.

Performance by Discipline

Models consistently show uneven performance across the six disciplines. Humanities and Social Science questions tend to yield the highest scores, while Tech and Engineering questions remain the most challenging:

Discipline	Typical Top-Model Range	Key Challenge
Humanities & Social Science	75% to 85%	Requires cultural and historical knowledge but visual complexity is lower
Art & Design	70% to 80%	Demands aesthetic judgment and art history knowledge
Business	68% to 78%	Financial charts and accounting problems
Health & Medicine	65% to 78%	Complex medical imagery and clinical reasoning
Science	60% to 72%	Diverse scientific diagrams and mathematical reasoning
Tech & Engineering	50% to 65%	Circuit diagrams, engineering schematics, and code

Performance by Image Type

The type of visual content in a question has a major impact on model accuracy. Models trained primarily on natural images and web content tend to struggle with specialized visual formats:

Image Type	Best Model Performance	Worst Model Performance	Key Insight
Photos and paintings	75% to 85%	40% to 50%	Most familiar image type during training
Charts and graphs	65% to 80%	35% to 45%	Requires precise numerical reading
Chemical structures	40% to 55%	15% to 25%	Specialized domain notation
Circuit diagrams	35% to 50%	Near random	Very limited training exposure
Music sheets	25% to 40%	Near random	Extremely rare in training data
Geometric shapes	30% to 45%	Near random	Requires spatial reasoning

Error Analysis

The MMMU authors and subsequent researchers have identified four primary categories of model errors:

Error Type	Frequency	Description
Perception errors	~30%	Misinterpreting visual elements (misreading a chart value, confusing parts of a diagram)
Knowledge gaps	~35%	Lacking the domain-specific information needed to answer correctly
Reasoning failures	~25%	Applying incorrect logical inference or making computational mistakes
Integration errors	~10%	Failing to properly combine visual and textual information

These error categories are not mutually exclusive. A single incorrect answer may involve both a perception error (misreading part of an image) and a reasoning failure (drawing an incorrect conclusion from the misread data).

MMMU-Pro

Overview

MMMU-Pro is a more challenging successor benchmark introduced in September 2024 by a largely overlapping team of researchers, including Xiang Yue, Tianyu Zheng, Yuansheng Ni, and others. The paper was accepted at ACL 2025. MMMU-Pro was designed to address limitations in the original MMMU by filtering out questions that could be solved through shortcuts and introducing harder evaluation conditions.

Three-Step Construction Process

MMMU-Pro was constructed from the original MMMU dataset through a rigorous three-step process:

Step 1: Text-only filtering. The team used four strong open-source LLMs (Llama3-70B-Instruct, Qwen2-72B-Instruct, Yi-1.5-34B-Chat, and Mixtral-8x22B-Instruct) to identify questions that could be answered correctly without seeing the image. Each model attempted each question text-only across ten trials. Questions that were answered correctly by at least three of the four models more than five times were excluded. This process ensured that every remaining question genuinely requires visual understanding.

Step 2: Augmenting candidate options. For the remaining questions, human experts working alongside GPT-4o expanded the multiple-choice options from 4 to 10. This makes random guessing far less effective (10% chance versus 25%) and forces models to discriminate among more plausible distractors. During this phase, 70 additional questions were removed because the image-question relevance was insufficient, leaving 1,730 standard-format questions.

Step 3: Vision-only input setting. Human annotators manually captured screenshots and photographs of the questions displayed on screens, with varying backgrounds, font styles, and font sizes. This created a parallel set of 1,730 "vision-only" questions where the model must extract the question text from the image itself, testing integrated visual and textual processing without separate text input.

The final MMMU-Pro dataset contains 3,460 questions (1,730 standard + 1,730 vision-only), evenly distributed across the same 30 subjects as the original MMMU (approximately 60 questions per subject before vision-only duplication).

MMMU-Pro Results

Performance on MMMU-Pro is dramatically lower than on the original MMMU. The following table shows results from the original MMMU-Pro paper:

Model	MMMU (Val)	MMMU-Pro (Overall)	Performance Drop
GPT-4o	69.1%	51.9%	-17.2 pp
Claude 3.5 Sonnet	68.3%	51.5%	-16.8 pp
Gemini 1.5 Pro	65.8%	46.9%	-18.9 pp
Qwen2-VL-72B	64.5%	46.2%	-18.3 pp
VILA-1.5-40B	51.9%	25.0%	-26.9 pp

The sharp performance drops demonstrate that a significant portion of MMMU accuracy came from shortcuts and guessing strategies rather than genuine multimodal understanding.

MMMU-Pro Key Findings

Option augmentation is highly effective. Expanding from 4 to 10 choices reduced GPT-4o's accuracy by 10.7 percentage points even before applying the vision-only setting, showing that models relied partly on process-of-elimination strategies.
Vision-only input is challenging. Embedding questions within images caused additional performance drops, particularly for models like LLaVA-OneVision-72B (14.0 percentage point decrease), indicating that many models struggle with integrated text-image processing.
OCR prompts have minimal effect. Explicitly prompting models to perform OCR on the vision-only inputs did not significantly improve performance, suggesting that capable models already extract text effectively but struggle with the deeper reasoning challenges.
Chain of Thought helps selectively. Chain of Thought prompting improved some models substantially (Claude 3.5 Sonnet rose from 42.7% to 55.0% in the standard setting) but hurt others, particularly smaller models with weaker instruction-following abilities.

Current MMMU-Pro Leaderboard (March 2026)

As models have improved, MMMU-Pro scores have risen considerably from the original paper's results:

Rank	Model	Organization	Score
1	GPT-5.4	OpenAI	81.2%
2	Gemini 3 Flash	Google	81.2%
3	Gemini 3 Pro	Google	81.0%
7	GPT-5	OpenAI	78.4%
8	Claude Opus 4.6	Anthropic	77.3%
12	o3	OpenAI	76.4%
25	GPT-4o	OpenAI	59.9%

MMMU Family of Benchmarks

The success of MMMU has led to the development of several related benchmarks, forming a broader "MMMU family" that evaluates different aspects of multimodal understanding:

CMMMU (Chinese MMMU)

CMMMU (Chinese Massive Multi-discipline Multimodal Understanding) was released in early 2024 as a Chinese-language counterpart to MMMU. It contains approximately 12,000 manually collected multimodal questions covering the same six disciplines and 30 subjects as MMMU, but sourced from Chinese educational curricula. CMMMU includes 39 heterogeneous image types and tests models on Chinese-specific academic content. Even GPT-4V only achieved approximately 42% accuracy on CMMMU, highlighting the additional challenge of non-English academic evaluation.

Video-MMMU

Video-MMMU extends the MMMU paradigm to video understanding. Developed by researchers at Nanyang Technological University and Carnegie Mellon University, Video-MMMU contains 300 expert-level, college-level lecture videos and 900 human-annotated questions across the same six disciplines and 30 subjects. The benchmark evaluates knowledge acquisition through three cognitive stages: Perception (identifying key information), Comprehension (understanding underlying concepts), and Adaptation (applying knowledge to novel scenarios). A novel metric called delta-knowledge measures how much a model's performance improves after watching an educational video. Human learners achieved a 33.1% knowledge gain, while GPT-4o achieved only 15.6% and Claude 3.5 Sonnet achieved 11.4%, revealing a significant gap in video-based learning capabilities.

Uni-MMMU

Uni-MMMU, published in late 2025, is a unified benchmark that tests bidirectional synergy between generation and understanding across eight reasoning-centric domains including science, coding, mathematics, and puzzles. Each task is bidirectionally coupled, requiring models either to leverage conceptual understanding to guide precise visual synthesis or to use generation as a cognitive scaffold for analytical reasoning.

Applications and Impact

Significance in AI Research

MMMU has become one of the most widely cited and used benchmarks for multimodal AI evaluation since its release. Its significance stems from several factors:

It was the first large-scale benchmark to systematically test multimodal models on college-level expert knowledge across multiple disciplines.
Major AI companies including OpenAI, Google, Anthropic, and Meta routinely report MMMU scores when releasing new multimodal models.
The benchmark has helped identify specific weaknesses in vision-language models, such as poor handling of specialized visual formats and weak cross-modal reasoning.
MMMU's design philosophy, prioritizing expert-level reasoning over basic visual understanding, has influenced the development of other challenging benchmarks.

Research Applications

MMMU enables several lines of research:

Multimodal architecture design: Identifying which model architectures best handle diverse visual inputs combined with knowledge-intensive reasoning.
Knowledge integration: Studying how to combine visual perception with domain expertise during both pre-training and fine-tuning.
Visual generalization: Understanding why models transfer poorly to uncommon image formats and developing training strategies to address this.
Curriculum learning: Using MMMU's difficulty levels to design progressive training regimes.
Error analysis: Diagnosing whether model failures stem from perception, knowledge, or reasoning limitations.

Professional and Educational Applications

Field	Application	MMMU Relevance
Medicine	Diagnostic assistance and medical education	Medical image interpretation (X-rays, histopathology)
Engineering	Design validation and review	Technical drawing and schematic comprehension
Finance	Automated report analysis	Chart and data visualization understanding
Education	AI tutoring systems and automated assessment	Multi-discipline knowledge evaluation
Research	Scientific literature review	Scientific diagram and figure interpretation

Dataset Access and Usage

Access Methods

MMMU is publicly available through Hugging Face Datasets:

from datasets import load_dataset

# Load the full MMMU dataset
dataset = load_dataset("MMMU/MMMU")

# Access specific subjects
accounting = load_dataset("MMMU/MMMU", "Accounting")
physics = load_dataset("MMMU/MMMU", "Physics")

The dataset is also available for direct download from the Hugging Face repository.

Evaluation Server

Launch date: December 4, 2023
Platform: EvalAI
URL: https://eval.ai/web/challenges/challenge-page/2179/leaderboard
Submission format: JSON file with model predictions
Test set answers released: February 12, 2026 (local evaluation now possible)

Each Data Sample

Every question in MMMU includes the following fields:

Field	Description
id	Unique identifier
question	Question text
options	Multiple-choice answer options (if applicable)
answer	Correct answer
explanation	Detailed explanation of the correct answer
image_1 to image_7	Up to 7 associated images
img_type	Type classification of the primary image
topic_difficulty	Difficulty level (Easy, Medium, Hard)
question_type	Multiple choice or open-ended
subfield	Specific subfield within the subject

Limitations

Limitation	Description	Impact
English only	All questions are in English	Does not assess multilingual multimodal capabilities (though CMMMU addresses Chinese)
Static dataset	Fixed set of questions that does not change	Models could potentially overfit through repeated evaluation or data contamination
US-centric curriculum	Questions drawn primarily from US college materials	May not reflect educational standards in other countries
Limited interactivity	Single-turn question answering only	Does not test multi-turn dialogue or iterative problem solving
Mostly multiple choice	94% of questions are multiple choice	May not fully capture depth of understanding; partial credit is not possible
No video or audio	Only static images and text	Does not test temporal reasoning or audio understanding (though Video-MMMU addresses this)

Future Directions

Several research directions build on the foundation laid by MMMU:

Multilingual expansion: Extending the benchmark to additional languages beyond English and Chinese. CMMMU covers Chinese, but benchmarks for other major languages remain needed.
Dynamic evaluation: Developing procedurally generated or frequently updated question sets to prevent contamination and memorization.
Interactive evaluation: Multi-turn reasoning tasks where models can ask clarifying questions or request additional information.
Video and temporal reasoning: Extending to video-based questions requiring understanding of temporal sequences and dynamic visual content.
Fine-grained skill assessment: More detailed breakdowns of perception, knowledge, and reasoning sub-skills to identify specific model weaknesses.
Unified generation and understanding: Benchmarks like Uni-MMMU that test whether models can both generate and understand visual content across disciplines.

MMLU: Text-only multi-task language understanding benchmark covering 57 subjects
MMLU-Pro: Enhanced version of MMLU with harder questions and 10 answer options
MathVista: Mathematical reasoning with visual inputs
ScienceQA: Multimodal science questions from elementary to high school level
ChartQA: Chart and data visualization understanding
AI2D: Science diagram understanding
TextVQA: Text reading in natural images
DocVQA: Document understanding and visual question answering
HumanEval: Code generation benchmark (text-only, for comparison)
VQA v2.0: General visual question answering on natural images

References

Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., Wei, C., Yu, B., Yuan, R., Sun, R., Yin, M., Zheng, B., Yang, Z., Liu, Y., Huang, W., Sun, H., Su, Y., & Chen, W. (2024). "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024)*. https://arxiv.org/abs/2311.16502
Yue, X., Zheng, T., Ni, Y., Wang, Y., Zhang, K., Tong, S., Sun, Y., Yu, B., Zhang, G., Sun, H., Su, Y., Chen, W., & Neubig, G. (2025). "MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark." *Proceedings of ACL 2025*. https://arxiv.org/abs/2409.02813
Hu, K., Wu, P., Pu, F., Xiao, W., Yue, X., Zhang, Y., Li, B., & Liu, Z. (2025). "Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos." https://videommmu.github.io/
Zhang, G., Du, Y., et al. (2024). "CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark." https://arxiv.org/abs/2401.11944
MMMU Benchmark Official Website. https://mmmu-benchmark.github.io/
MMMU Dataset on Hugging Face. https://huggingface.co/datasets/MMMU/MMMU
MMMU GitHub Repository. https://github.com/MMMU-Benchmark/MMMU
MMMU Leaderboard, LLM-Stats. https://llm-stats.com/benchmarks/mmmu
MMMU-Pro Leaderboard, LLM-Stats. https://llm-stats.com/benchmarks/mmmu-pro

Overview

Motivation

Dataset Construction

Collection Methodology

Quality Control Pipeline

Data Splits

Human Expert Baseline

Technical Specifications

Discipline and Subject Coverage

Image Type Diversity

Question Types and Formats

Evaluation Methodology

Zero-shot Protocol

Skill Dimensions

Difficulty Levels

Performance Analysis

Original Paper Results (November 2023)

Current Leaderboard (as of March 2026)

Performance by Discipline

Performance by Image Type

Error Analysis

MMMU-Pro

Overview

Three-Step Construction Process

MMMU-Pro Results

MMMU-Pro Key Findings

Current MMMU-Pro Leaderboard (March 2026)

MMMU Family of Benchmarks

CMMMU (Chinese MMMU)

Video-MMMU

Uni-MMMU

Applications and Impact

Significance in AI Research

Research Applications

Professional and Educational Applications

Dataset Access and Usage

Access Methods

Evaluation Server

Each Data Sample

Limitations

Future Directions

Related Benchmarks

See Also

References

Improve this article

Related Articles

ARC-AGI 2

Humanity's Last Exam

CLIP Score

Video-MME

MMMU-Pro

EgoSchema

Overview

Motivation

Dataset Construction

Collection Methodology

Quality Control Pipeline

Data Splits

Human Expert Baseline

Technical Specifications

Discipline and Subject Coverage

Image Type Diversity

Question Types and Formats

Evaluation Methodology

Zero-shot Protocol

Skill Dimensions

Difficulty Levels

Performance Analysis

Original Paper Results (November 2023)

Current Leaderboard (as of March 2026)

Performance by Discipline

Performance by Image Type

Error Analysis

MMMU-Pro

Overview

Three-Step Construction Process

MMMU-Pro Results

MMMU-Pro Key Findings

Current MMMU-Pro Leaderboard (March 2026)

MMMU Family of Benchmarks