**
| MMMU | |
|---|---|
| Overview | |
| Full name | Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark |
| Abbreviation | MMMU |
| Description | A massive multi-discipline multimodal benchmark evaluating expert-level understanding and reasoning across college-level subjects |
| Release date | 2023-11 |
| Latest version | 1.0 |
| Benchmark updated | 2023-12-04 |
| Authors | Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, et al. (22 authors) |
| Organizations | Ohio State University, University of Waterloo, IN.AI Research, Carnegie Mellon University |
| Venue | CVPR 2024 (Oral) |
| Technical Details | |
| Type | Multimodal Understanding, Expert Knowledge |
| Modality | Text, Vision (Images) |
| Task format | Multiple choice (94%), Open-ended (6%) |
| Number of questions | 11,550 |
| Data splits | Dev: 150, Validation: 900, Test: 10,500 |
| Subjects | 30 subjects across 183 subfields |
| Image types | 30+ heterogeneous types |
| Evaluation metric | Accuracy, Zero-shot performance |
| Domains | Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, Tech & Engineering |
| Languages | English |
| Performance | |
| Human expert range | 76.2% to 88.6% |
| Random guess baseline | 22.3% |
| SOTA score | 85.4% |
| SOTA model | GPT-5.1 |
| SOTA date | 2025 |
| Saturated | No |
| Resources | |
| Website | Official website |
| Paper | arXiv:2311.16502 |
| GitHub | Repository |
| Dataset | Hugging Face |
| Evaluation server | EvalAI |
| Successors | MMMU-Pro, Video-MMMU, CMMMU |
MMMU** (Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark) is a comprehensive multimodal AI benchmark designed to evaluate models on expert-level understanding and reasoning across college-level academic subjects. Released in November 2023 by a team of 22 researchers led by Xiang Yue at Ohio State University and Wenhu Chen at the University of Waterloo, MMMU contains 11,550 meticulously collected questions sourced from college exams, quizzes, and textbooks. It spans 30 subjects, 183 subfields, and six core academic disciplines, and features over 30 different image types ranging from standard photographs to specialized notations like chemical structures and music sheets. The benchmark was presented as an oral paper at CVPR 2024 and has since become one of the most widely used evaluations for assessing artificial general intelligence capabilities in multimodal contexts.
MMMU addresses a critical gap in AI evaluation by testing models on tasks that require both advanced visual perception and domain-specific knowledge reasoning. Unlike earlier multimodal benchmarks that focused on elementary visual understanding (identifying objects in photos, reading text from signs, or answering simple questions about natural images), MMMU demands college-level subject expertise combined with sophisticated reasoning about diverse visual content. The questions mirror the kind of problems students encounter in university courses across the sciences, humanities, engineering, and professional fields.
The benchmark is named for its defining characteristics: "Massive" refers to the scale of 11,550 questions; "Multi-discipline" indicates coverage across six broad academic disciplines; "Multimodal" highlights that questions combine text and images; and "Understanding and Reasoning" reflects the higher-order cognitive skills required. The full title is the Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI, signaling the authors' goal of measuring progress toward human-expert-level artificial general intelligence.
The development of MMMU was motivated by several observations about the state of multimodal AI evaluation in 2023:
The authors explicitly framed MMMU as a tool for the research community to measure progress toward "expert AGI," arguing that the ability to reason across multiple academic disciplines using varied visual inputs represents a meaningful milestone on the path to general intelligence.
MMMU was built through a large-scale, human-driven collection effort. Over 50 college students from diverse academic backgrounds participated in gathering questions from textbooks, online educational resources, college exams, and lecture materials. The collection process followed strict guidelines to ensure quality, diversity, and difficulty.
Each question in the dataset was required to include at least one image that is essential to answering the question correctly. This design choice ensures that models cannot simply rely on text-based reasoning to bypass the visual component. The team collected questions from a wide range of sources, including university-level textbooks published by major academic publishers, past examination papers from accredited institutions, online educational platforms, and lecture slides from college courses.
The dataset underwent a multi-stage review process:
The 11,550 questions are divided into three splits:
| Split | Size | Purpose |
|---|---|---|
| Development (dev) | 150 | Few-shot and in-context learning experiments |
| Validation | 900 | Debugging models, selecting hyperparameters, and quick evaluations |
| Test | 10,500 | Official evaluation (answers withheld; submission via EvalAI server) |
The development set contains 5 questions per subject (150 total across 30 subjects). The validation set contains 30 questions per subject (900 total). The test set holds the remaining 10,500 questions, and its answer labels were kept private until February 2026, when the test set answers were publicly released.
To establish a meaningful human performance baseline, the MMMU team recruited 90 college senior students, with 3 experts assigned to each of the 30 subjects. Each expert completed 30 questions from the validation set within their discipline. Experts were allowed to consult their textbooks but were prohibited from searching the Internet for answers. Human expert accuracy ranged from 76.2% to 88.6% across subjects, providing a target for AI models to match or exceed.
MMMU covers six core academic disciplines, each containing multiple subjects:
| Discipline | Subjects | Subject Count |
|---|---|---|
| Art & Design | Art, Art Theory, Design, Music | 4 |
| Business | Accounting, Economics, Finance, Management, Marketing | 5 |
| Science | Biology, Chemistry, Geography, Math, Physics | 5 |
| Health & Medicine | Basic Medical Science, Clinical Medicine, Diagnostics & Laboratory Medicine, Pharmacy, Public Health | 5 |
| Humanities & Social Science | History, Literature, Psychology, Sociology | 4 |
| Tech & Engineering | Agriculture, Architecture & Engineering, Computer Science, Electronics, Energy & Power, Materials, Mechanical Engineering | 7 |
The 30 subjects are further broken down into 183 subfields. For example, within Physics, subfields include classical mechanics, thermodynamics, electromagnetism, optics, and quantum physics. This granularity ensures that the benchmark captures a wide spectrum of college-level knowledge.
One of MMMU's distinguishing features is its inclusion of over 30 heterogeneous image types. Most prior benchmarks focused on natural photographs, but MMMU deliberately includes many specialized visual formats that professionals and students encounter in their fields:
| Category | Image Types | Typical Disciplines |
|---|---|---|
| Photographs & artwork | Natural photos, paintings, sculptures, sketches | Art & Design, Humanities |
| Scientific diagrams | Biological diagrams, chemical structures, physics diagrams, molecular models | Science, Health & Medicine |
| Data visualizations | Bar charts, line graphs, pie charts, heatmaps, scatter plots, tables | Business, Science, Engineering |
| Technical drawings | Circuit diagrams, architectural blueprints, flowcharts, engineering schematics | Tech & Engineering |
| Maps & geography | Topographic maps, political maps, climate maps, geological cross-sections | Science, Humanities |
| Specialized notation | Music sheets, mathematical proofs, code snippets | Art & Design, Science, Engineering |
| Medical imagery | X-rays, MRI scans, CT scans, histopathology slides, microscopy images | Health & Medicine |
| 3D representations | 3D models, CAD renderings, crystal structures | Engineering, Science |
Each question can include up to seven images, allowing the benchmark to test reasoning about complex multi-image scenarios such as comparing two X-rays or analyzing a series of related diagrams.
| Question Type | Approximate Percentage | Description |
|---|---|---|
| Multiple choice | ~94% | Select the correct answer from 4 or 5 options |
| Open-ended | ~6% | Provide a short numerical or textual answer |
The heavy emphasis on multiple-choice questions allows for automated and unambiguous evaluation. Open-ended questions are included to test whether models can generate correct answers without the benefit of answer choices.
MMMU employs strict zero-shot evaluation:
The benchmark is designed to evaluate three core skill dimensions:
| Skill | Description | What It Tests |
|---|---|---|
| Perception | Accurately interpreting visual information from diverse image types | Can the model correctly read a chart, identify a chemical structure, or parse a circuit diagram? |
| Knowledge | Domain-specific factual understanding at the college level | Does the model know the relevant facts, formulas, definitions, or historical context? |
| Reasoning | Logical inference, problem-solving, and multi-step deduction | Can the model combine visual evidence with domain knowledge to derive the correct answer? |
Many MMMU questions require all three skills simultaneously. For instance, a question about organic chemistry might require recognizing a molecular structure (perception), knowing reaction mechanisms (knowledge), and predicting the product of a specific reaction (reasoning).
Questions in MMMU are categorized by difficulty:
| Difficulty | GPT-4V Accuracy (original paper) | Description |
|---|---|---|
| Easy | 76.1% | Straightforward questions requiring basic recognition and recall |
| Medium | 55.6% | Questions needing moderate domain knowledge and multi-step reasoning |
| Hard | Near random performance | Complex questions requiring deep expertise and sophisticated reasoning |
The sharp drop-off from Easy to Hard questions illustrates that even advanced models struggle significantly once genuine expert-level reasoning is required.
When MMMU was first released, the authors evaluated a range of proprietary and open-source large multimodal models. The results revealed a substantial gap between the best models and human experts:
| Model | Overall Accuracy | Notes |
|---|---|---|
| Human experts | 76.2% to 88.6% | 90 college seniors across 30 subjects |
| Gemini Ultra | 59.4% | Google's top multimodal model at the time |
| GPT-4V | 56.8% | OpenAI's multimodal model |
| BLIP2-FLAN-T5-XXL | ~34% | Leading open-source model at the time |
| LLaVA-1.5 | ~34% | Open-source multimodal model |
| Random guess | 22.3% | Baseline for multiple-choice questions |
Key findings from the original evaluation:
Since its release, MMMU has been widely adopted as a standard evaluation benchmark. Top model scores have improved substantially, with the best systems now exceeding 85% accuracy. The following table shows a selection of notable scores from the current leaderboard:
| Rank | Model | Organization | Score |
|---|---|---|---|
| 1 | GPT-5.1 | OpenAI | 85.4% |
| 4 | GPT-5 | OpenAI | 84.2% |
| 5 | Qwen3.5-122B-A10B | Alibaba | 83.9% |
| 6 | o3 | OpenAI | 82.9% |
| 8 | Gemini 2.5 Pro | 82.0% | |
| 9 | o4-mini | OpenAI | 81.6% |
| 11 | Gemini 2.5 Flash | 79.7% | |
| 14 | Grok-3 | xAI | 78.0% |
| 15 | o1 | OpenAI | 77.6% |
| 18 | Claude 3.7 Sonnet | Anthropic | 75.0% |
| 20 | Claude Sonnet 4 | Anthropic | 74.4% |
| 24 | GPT-4o | OpenAI | 72.2% |
| 27 | Qwen2.5 VL 72B | Alibaba | 70.2% |
| 31 | Claude 3.5 Sonnet | Anthropic | 68.3% |
| 34 | Gemini 1.5 Pro | 65.9% | |
| 40 | Llama 3.2 90B | Meta | 60.3% |
Several models now surpass the lower end of human expert performance (76.2%), but the best human experts still outperform all current systems. The top-performing model, GPT-5.1, reaches 85.4%, which falls within the human expert range of 76.2% to 88.6%.
Models consistently show uneven performance across the six disciplines. Humanities and Social Science questions tend to yield the highest scores, while Tech and Engineering questions remain the most challenging:
| Discipline | Typical Top-Model Range | Key Challenge |
|---|---|---|
| Humanities & Social Science | 75% to 85% | Requires cultural and historical knowledge but visual complexity is lower |
| Art & Design | 70% to 80% | Demands aesthetic judgment and art history knowledge |
| Business | 68% to 78% | Financial charts and accounting problems |
| Health & Medicine | 65% to 78% | Complex medical imagery and clinical reasoning |
| Science | 60% to 72% | Diverse scientific diagrams and mathematical reasoning |
| Tech & Engineering | 50% to 65% | Circuit diagrams, engineering schematics, and code |
The type of visual content in a question has a major impact on model accuracy. Models trained primarily on natural images and web content tend to struggle with specialized visual formats:
| Image Type | Best Model Performance | Worst Model Performance | Key Insight |
|---|---|---|---|
| Photos and paintings | 75% to 85% | 40% to 50% | Most familiar image type during training |
| Charts and graphs | 65% to 80% | 35% to 45% | Requires precise numerical reading |
| Chemical structures | 40% to 55% | 15% to 25% | Specialized domain notation |
| Circuit diagrams | 35% to 50% | Near random | Very limited training exposure |
| Music sheets | 25% to 40% | Near random | Extremely rare in training data |
| Geometric shapes | 30% to 45% | Near random | Requires spatial reasoning |
The MMMU authors and subsequent researchers have identified four primary categories of model errors:
| Error Type | Frequency | Description |
|---|---|---|
| Perception errors | ~30% | Misinterpreting visual elements (misreading a chart value, confusing parts of a diagram) |
| Knowledge gaps | ~35% | Lacking the domain-specific information needed to answer correctly |
| Reasoning failures | ~25% | Applying incorrect logical inference or making computational mistakes |
| Integration errors | ~10% | Failing to properly combine visual and textual information |
These error categories are not mutually exclusive. A single incorrect answer may involve both a perception error (misreading part of an image) and a reasoning failure (drawing an incorrect conclusion from the misread data).
MMMU-Pro is a more challenging successor benchmark introduced in September 2024 by a largely overlapping team of researchers, including Xiang Yue, Tianyu Zheng, Yuansheng Ni, and others. The paper was accepted at ACL 2025. MMMU-Pro was designed to address limitations in the original MMMU by filtering out questions that could be solved through shortcuts and introducing harder evaluation conditions.
MMMU-Pro was constructed from the original MMMU dataset through a rigorous three-step process:
Step 1: Text-only filtering. The team used four strong open-source LLMs (Llama3-70B-Instruct, Qwen2-72B-Instruct, Yi-1.5-34B-Chat, and Mixtral-8x22B-Instruct) to identify questions that could be answered correctly without seeing the image. Each model attempted each question text-only across ten trials. Questions that were answered correctly by at least three of the four models more than five times were excluded. This process ensured that every remaining question genuinely requires visual understanding.
Step 2: Augmenting candidate options. For the remaining questions, human experts working alongside GPT-4o expanded the multiple-choice options from 4 to 10. This makes random guessing far less effective (10% chance versus 25%) and forces models to discriminate among more plausible distractors. During this phase, 70 additional questions were removed because the image-question relevance was insufficient, leaving 1,730 standard-format questions.
Step 3: Vision-only input setting. Human annotators manually captured screenshots and photographs of the questions displayed on screens, with varying backgrounds, font styles, and font sizes. This created a parallel set of 1,730 "vision-only" questions where the model must extract the question text from the image itself, testing integrated visual and textual processing without separate text input.
The final MMMU-Pro dataset contains 3,460 questions (1,730 standard + 1,730 vision-only), evenly distributed across the same 30 subjects as the original MMMU (approximately 60 questions per subject before vision-only duplication).
Performance on MMMU-Pro is dramatically lower than on the original MMMU. The following table shows results from the original MMMU-Pro paper:
| Model | MMMU (Val) | MMMU-Pro (Overall) | Performance Drop |
|---|---|---|---|
| GPT-4o | 69.1% | 51.9% | -17.2 pp |
| Claude 3.5 Sonnet | 68.3% | 51.5% | -16.8 pp |
| Gemini 1.5 Pro | 65.8% | 46.9% | -18.9 pp |
| Qwen2-VL-72B | 64.5% | 46.2% | -18.3 pp |
| VILA-1.5-40B | 51.9% | 25.0% | -26.9 pp |
The sharp performance drops demonstrate that a significant portion of MMMU accuracy came from shortcuts and guessing strategies rather than genuine multimodal understanding.
As models have improved, MMMU-Pro scores have risen considerably from the original paper's results:
| Rank | Model | Organization | Score |
|---|---|---|---|
| 1 | GPT-5.4 | OpenAI | 81.2% |
| 2 | Gemini 3 Flash | 81.2% | |
| 3 | Gemini 3 Pro | 81.0% | |
| 7 | GPT-5 | OpenAI | 78.4% |
| 8 | Claude Opus 4.6 | Anthropic | 77.3% |
| 12 | o3 | OpenAI | 76.4% |
| 25 | GPT-4o | OpenAI | 59.9% |
The success of MMMU has led to the development of several related benchmarks, forming a broader "MMMU family" that evaluates different aspects of multimodal understanding:
CMMMU (Chinese Massive Multi-discipline Multimodal Understanding) was released in early 2024 as a Chinese-language counterpart to MMMU. It contains approximately 12,000 manually collected multimodal questions covering the same six disciplines and 30 subjects as MMMU, but sourced from Chinese educational curricula. CMMMU includes 39 heterogeneous image types and tests models on Chinese-specific academic content. Even GPT-4V only achieved approximately 42% accuracy on CMMMU, highlighting the additional challenge of non-English academic evaluation.
Video-MMMU extends the MMMU paradigm to video understanding. Developed by researchers at Nanyang Technological University and Carnegie Mellon University, Video-MMMU contains 300 expert-level, college-level lecture videos and 900 human-annotated questions across the same six disciplines and 30 subjects. The benchmark evaluates knowledge acquisition through three cognitive stages: Perception (identifying key information), Comprehension (understanding underlying concepts), and Adaptation (applying knowledge to novel scenarios). A novel metric called delta-knowledge measures how much a model's performance improves after watching an educational video. Human learners achieved a 33.1% knowledge gain, while GPT-4o achieved only 15.6% and Claude 3.5 Sonnet achieved 11.4%, revealing a significant gap in video-based learning capabilities.
Uni-MMMU, published in late 2025, is a unified benchmark that tests bidirectional synergy between generation and understanding across eight reasoning-centric domains including science, coding, mathematics, and puzzles. Each task is bidirectionally coupled, requiring models either to leverage conceptual understanding to guide precise visual synthesis or to use generation as a cognitive scaffold for analytical reasoning.
MMMU has become one of the most widely cited and used benchmarks for multimodal AI evaluation since its release. Its significance stems from several factors:
MMMU enables several lines of research:
| Field | Application | MMMU Relevance |
|---|---|---|
| Medicine | Diagnostic assistance and medical education | Medical image interpretation (X-rays, histopathology) |
| Engineering | Design validation and review | Technical drawing and schematic comprehension |
| Finance | Automated report analysis | Chart and data visualization understanding |
| Education | AI tutoring systems and automated assessment | Multi-discipline knowledge evaluation |
| Research | Scientific literature review | Scientific diagram and figure interpretation |
MMMU is publicly available through Hugging Face Datasets:
from datasets import load_dataset
# Load the full MMMU dataset
dataset = load_dataset("MMMU/MMMU")
# Access specific subjects
accounting = load_dataset("MMMU/MMMU", "Accounting")
physics = load_dataset("MMMU/MMMU", "Physics")
The dataset is also available for direct download from the Hugging Face repository.
Every question in MMMU includes the following fields:
| Field | Description |
|---|---|
| id | Unique identifier |
| question | Question text |
| options | Multiple-choice answer options (if applicable) |
| answer | Correct answer |
| explanation | Detailed explanation of the correct answer |
| image_1 to image_7 | Up to 7 associated images |
| img_type | Type classification of the primary image |
| topic_difficulty | Difficulty level (Easy, Medium, Hard) |
| question_type | Multiple choice or open-ended |
| subfield | Specific subfield within the subject |
| Limitation | Description | Impact |
|---|---|---|
| English only | All questions are in English | Does not assess multilingual multimodal capabilities (though CMMMU addresses Chinese) |
| Static dataset | Fixed set of questions that does not change | Models could potentially overfit through repeated evaluation or data contamination |
| US-centric curriculum | Questions drawn primarily from US college materials | May not reflect educational standards in other countries |
| Limited interactivity | Single-turn question answering only | Does not test multi-turn dialogue or iterative problem solving |
| Mostly multiple choice | 94% of questions are multiple choice | May not fully capture depth of understanding; partial credit is not possible |
| No video or audio | Only static images and text | Does not test temporal reasoning or audio understanding (though Video-MMMU addresses this) |
Several research directions build on the foundation laid by MMMU: