| MMMU | |
|---|---|
| Overview | |
| Full name | Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark |
| Abbreviation | MMMU |
| Description | A massive multi-discipline multimodal benchmark evaluating expert-level understanding and reasoning across college-level subjects |
| Release date | 2023-11 |
| Latest version | 1.0 |
| Benchmark updated | 2023-12-04 |
| Authors | Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Et al. |
| Organization | Ohio State University, University of Waterloo |
| Technical Details | |
| Type | Multimodal Understanding, Expert Knowledge |
| Modality | Text, Vision (Images) |
| Task format | Multiple choice, Open-ended |
| Number of tasks | 11,500 |
| Total examples | 11,500 |
| Evaluation metric | Accuracy, Zero-shot performance |
| Domains | Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, Tech & Engineering |
| Languages | English |
| Performance | |
| Human performance | Not formally reported |
| Baseline | 22.3% (Random guess baseline) |
| SOTA score | 69.1% |
| SOTA model | GPT-4o |
| SOTA date | 2024 |
| Saturated | No |
| Resources | |
| Website | Official website |
| Paper | Paper |
| GitHub | Repository |
| Dataset | Download
|
| Successor | MMMU-Pro |
MMMU (Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark) is a comprehensive multimodal AI benchmark designed to evaluate models' expert-level understanding and reasoning capabilities across college-level subjects. Released in November 2023 by researchers from Ohio State University and University of Waterloo, MMMU contains 11,500 meticulously collected questions from college exams, quizzes, and textbooks, representing one of the most challenging benchmarks for assessing artificial general intelligence in multimodal contexts.
MMMU addresses a critical gap in AI evaluation by testing models on tasks that require both advanced visual perception and domain-specific knowledge reasoning. Unlike existing multimodal benchmarks that focus on elementary visual understanding, MMMU demands college-level subject expertise combined with sophisticated reasoning about diverse visual content including charts, diagrams, maps, tables, music sheets, and chemical structures.
The development of MMMU was motivated by several key observations:
The benchmark aims to evaluate whether AI systems can achieve human expert-level performance across multiple academic disciplines, a crucial milestone toward AGI.
MMMU's 11,500 questions are distributed across six core disciplines:
| Discipline | Number of Subjects | Question Count | Example Topics |
|---|---|---|---|
| Art & Design | 8 | ~1,900 | Art History, Design Principles, Music Theory |
| Business | 5 | ~1,900 | Accounting, Economics, Finance, Management |
| Science | 6 | ~1,900 | Biology, Chemistry, Physics, Earth Science |
| Health & Medicine | 5 | ~1,900 | Clinical Medicine, Basic Medicine, Diagnostics |
| Humanities & Social Science | 9 | ~1,900 | History, Literature, Psychology, Sociology |
| Tech & Engineering | 7 | ~1,900 | Computer Science, Electrical Engineering, Materials Science |
The benchmark spans 30 subjects and 183 subfields, ensuring comprehensive coverage of college-level knowledge.
MMMU includes 30 heterogeneous image types, making it uniquely challenging:
| Category | Image Types | Frequency | Challenge Level |
|---|---|---|---|
| Common Visuals | Photos, Paintings, Sketches | High | Low-Medium |
| Scientific Diagrams | Chemical structures, Biological diagrams, Physics diagrams | Medium | High |
| Data Visualizations | Charts, Graphs, Tables, Heatmaps | High | Medium |
| Technical Drawings | Blueprints, Circuit diagrams, Flowcharts | Medium | High |
| Maps & Geography | Topographic maps, Political maps, Weather maps | Medium | Medium |
| Specialized Notation | Music sheets, Mathematical proofs, Code snippets | Low | Very High |
| Medical Imagery | X-rays, MRI scans, Microscopy images | Low | Very High |
| Question Type | Percentage | Description | Example |
|---|---|---|---|
| Multiple Choice | 70% | Select from 4-5 options | "Which chemical structure represents benzene?" |
| Multiple Response | 15% | Select all correct answers | "Identify all impressionist paintings" |
| Fill-in-the-blank | 10% | Complete missing information | "The GDP formula is: GDP = C + I + G + ___" |
| Open-ended | 5% | Short answer responses | "Explain the mechanism shown in the diagram" |
MMMU employs strict zero-shot evaluation:
The benchmark evaluates three core skills:
| Skill | Description | Weight | Assessment Method |
|---|---|---|---|
| Perception | Ability to accurately interpret visual information | 30% | Image recognition accuracy |
| Knowledge | Domain-specific factual understanding | 35% | Subject matter correctness |
| Reasoning | Logical inference and problem-solving | 35% | Multi-step reasoning tasks |
| Rank | Model | Overall Score | Art & Design | Business | Science | Health & Medicine | Humanities | Tech & Engineering |
|---|---|---|---|---|---|---|---|---|
| 1 | GPT-4o | 69.1% | 72.3% | 71.5% | 67.2% | 74.8% | 78.9% | 59.8% |
| 2 | Claude 3.5 Sonnet | 68.3% | 70.8% | 69.4% | 66.1% | 73.2% | 77.5% | 58.4% |
| 3 | Gemini 1.5 Pro | 62.2% | 64.5% | 65.3% | 60.8% | 68.9% | 73.1% | 52.7% |
| 4 | Claude 3 Opus | 59.4% | 63.2% | 62.8% | 57.3% | 65.7% | 71.8% | 48.9% |
| 5 | Gemini Ultra | 59.4% | 58.1% | 62.7% | 56.9% | 71.3% | 78.3% | 53.0% |
| 6 | GPT-4V | 56.8% | 65.3% | 64.3% | 54.7% | 63.5% | 76.3% | 41.7% |
| 7 | Qwen-VL-MAX | 46.8% | 48.3% | 47.9% | 45.2% | 50.1% | 55.4% | 38.7% |
Models show significant performance variation across image types:
| Image Type | Best Performance | Worst Performance | Performance Gap |
|---|---|---|---|
| Photos & Paintings | 75-80% | 40-45% | 35% |
| Charts & Graphs | 65-70% | 35-40% | 30% |
| Chemical Structures | 35-40% | 15-20% | 20% |
| Music Sheets | 25-30% | 10-15% | 15% |
| Geometric Shapes | 30-35% | Near random | 30% |
1. **Performance Ceiling**: Even the best models (GPT-4o) achieve only ~69% accuracy 2. **Domain Disparities**: Models excel in Humanities (75-79%) but struggle with Tech & Engineering (41-60%) 3. **Visual Generalization**: Poor performance on uncommon image types indicates limited visual generalization 4. **Open-Source Gap**: Significant performance gap between proprietary and open-source models (20-25%)
MMMU-Pro is a more challenging variant introduced in 2024:
| Model | MMMU Score | MMMU-Pro Score | Performance Drop |
|---|---|---|---|
| GPT-4o | 69.1% | 26.9% | -42.2% |
| Claude 3.5 Sonnet | 68.3% | 25.8% | -42.5% |
| Gemini 1.5 Pro | 62.2% | 22.3% | -39.9% |
| Open-source Best | 46.8% | 16.8% | -30.0% |
| Challenge | Description | Impact | Example |
|---|---|---|---|
| Domain-Specific Notation | Understanding specialized symbols and conventions | High error rates | Musical notation, chemical formulas |
| Multi-step Reasoning | Complex problems requiring sequential logic | 40-50% failure rate | Physics problem solving |
| Cross-modal Integration | Combining visual and textual information | Inconsistent performance | Diagram-based questions |
| Rare Visual Formats | Processing uncommon image types | Near-random performance | Circuit diagrams, music sheets |
Common failure modes include: 1. **Perception Errors** (30%): Misinterpreting visual elements 2. **Knowledge Gaps** (35%): Lacking domain-specific information 3. **Reasoning Failures** (25%): Incorrect logical inference 4. **Integration Errors** (10%): Failing to combine visual and textual cues
MMMU enables several research directions:
| Field | Application | MMMU Relevance |
|---|---|---|
| Medicine | Diagnostic assistance | Medical image interpretation |
| Engineering | Design validation | Technical drawing comprehension |
| Finance | Report analysis | Chart and data visualization understanding |
| Education | Content creation | Multi-discipline knowledge integration |
| Research | Literature review | Scientific diagram interpretation |
```python
from datasets import load_dataset dataset = load_dataset("MMMU/MMMU")
```
| Limitation | Description | Impact |
|---|---|---|
| English Only | Questions in English language | Limited global applicability |
| Static Dataset | Fixed set of questions | Potential for overfitting |
| College Focus | US college curriculum bias | May not reflect global standards |
| Limited Interactivity | No multi-turn reasoning | Doesn't test dialogue capabilities |
| Answer Format | Mostly multiple choice | May not capture full understanding |
1. **Multilingual Extension**: Versions in other languages 2. **Dynamic Generation**: Procedurally generated questions 3. **Interactive Tasks**: Multi-turn problem solving 4. **Video Understanding**: Extension to video content 5. **Real-time Updates**: Incorporating current events and discoveries
Cite error: <ref> tag with name "mmmu_paper" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "mmmu_website" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "mmmu_github" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "mmmu_huggingface" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "mmmu_cvpr" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "mmmu_pro" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "medium_mmmu" defined in <references> has group attribute "" which does not appear in prior text.