VideoMMMU
| VideoMMMU | |
|---|---|
| Overview | |
| Full name | Video Multi-Modal Multi-disciplinary Understanding |
| Abbreviation | VideoMMMU, Video-MMMU |
| Description | A multi-modal benchmark evaluating knowledge acquisition from professional educational videos across six disciplines |
| Release date | 2025-01-23 |
| Latest version | 1.0 |
| Benchmark updated | 2025-01 |
| Authors | Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, Ziwei Liu |
| Organization | EvolvingLMMs-Lab |
| Technical Details | |
| Type | Video Understanding, Knowledge Acquisition, Multi-modal Learning |
| Modality | Video, Image, Text, Audio |
| Task format | Multiple-choice questions from educational videos |
| Number of tasks | 6 disciplines, 30 subjects |
| Total examples | 900 questions (300 videos) |
| Evaluation metric | Accuracy, Δknowledge (normalized learning gain) |
| Domains | Art, Business, Science, Medicine, Humanities, Engineering |
| Languages | English |
| Performance | |
| Human performance | Significantly higher than models |
| Baseline | ~50% (random guess baseline) |
| SOTA score | 84.6% (GitHub) / 65.78% (HuggingFace) |
| SOTA model | GPT-5-thinking (GitHub) / Claude-3.5-Sonnet (HuggingFace) |
| SOTA date | 2025-01 |
| Saturated | No |
| Resources | |
| Website | Official website |
| Paper | Paper |
| GitHub | Repository |
| Dataset | Download |
| License | Terms and conditions required
|
VideoMMMU (Video Multi-Modal Multi-disciplinary Understanding) is a benchmark designed to evaluate Large Multimodal Models (LMMs) on their ability to acquire and apply knowledge from professional educational videos. Released on January 23, 2025, by EvolvingLMMs-Lab[1], VideoMMMU uniquely focuses on measuring how well AI systems can learn from educational content rather than just comprehend it. The benchmark comprises 300 expert-level lecture-style videos across six professional disciplines with 900 human-annotated questions, introducing the innovative Δknowledge metric to quantify normalized learning gains.
Overview
VideoMMMU represents a paradigm shift in video understanding evaluation by treating videos as knowledge sources rather than mere content to comprehend. Unlike traditional video benchmarks that focus on perception and basic understanding, VideoMMMU systematically assesses whether AI models can actually acquire new knowledge from educational videos and apply it to novel scenarios. This approach mirrors human learning processes where watching educational content leads to knowledge acquisition that can be transferred to new situations[1].
The benchmark addresses a critical gap in evaluating artificial intelligence systems' educational capabilities. As AI increasingly assists in education and professional training, the ability to learn from video content becomes essential. VideoMMMU reveals that even state-of-the-art models like GPT-4o show limited learning gains (15.6% Δknowledge), highlighting significant room for improvement in this crucial capability.
Significance
VideoMMMU's importance stems from several groundbreaking contributions:
- **First Knowledge Acquisition Benchmark**: Pioneering evaluation of learning from videos rather than just understanding
- **Δknowledge Metric**: Innovative measurement of normalized learning gains
- **Multi-disciplinary Coverage**: Spans six professional fields ensuring comprehensive evaluation
- **Three-Stage Assessment**: Systematic evaluation of perception, comprehension, and adaptation
- **Educational AI Development**: Drives progress in educational technology applications
Dataset Composition
Professional Disciplines
VideoMMMU covers six major professional fields with 30 subjects[2]:
| Discipline | Number of Subjects | Example Topics | Video Count |
|---|---|---|---|
| **Art** | 5 | Art history, Music theory, Film studies | 50 |
| **Business** | 5 | Economics, Finance, Marketing, Management | 50 |
| **Science** | 5 | Physics, Chemistry, Biology, Computer Science | 50 |
| **Medicine** | 5 | Anatomy, Pathology, Pharmacology, Clinical Medicine | 50 |
| **Humanities** | 5 | History, Philosophy, Literature, Psychology | 50 |
| **Engineering** | 5 | Mechanical, Electrical, Civil, Software Engineering | 50 |
Video Characteristics
The benchmark's videos are carefully selected for educational quality:
| Characteristic | Specification | Purpose |
|---|---|---|
| **Content Type** | Expert-level lectures | Professional knowledge transfer |
| **Duration Range** | Variable (short clips to full lectures) | Diverse learning scenarios |
| **Production Quality** | High-quality educational content | Clear knowledge presentation |
| **Language** | English with clear narration | Accessibility |
| **Visual Elements** | Slides, diagrams, demonstrations | Multi-modal learning |
| **Audio Quality** | Professional recording | Clear explanation |
Question Design
Each video includes three carefully crafted questions:
| Question Type | Focus | Example |
|---|---|---|
| **Factual** | Direct information extraction | "What is the formula presented at 2:35?" |
| **Conceptual** | Understanding principles | "Why does this phenomenon occur?" |
| **Application** | Knowledge transfer | "How would this apply to a different scenario?" |
Evaluation Framework
Three-Stage Knowledge Assessment
VideoMMMU employs a revolutionary three-stage evaluation framework[1]:
| Stage | Cognitive Level | Assessment Focus | Example Task |
|---|---|---|---|
| **Perception** | Basic Processing | Information identification | "Identify the key terms mentioned" |
| **Comprehension** | Understanding | Concept integration | "Explain the relationship between X and Y" |
| **Adaptation** | Application | Knowledge transfer | "Apply this principle to a new problem" |
The Δknowledge Metric
The benchmark introduces a groundbreaking metric for measuring learning efficiency:
- Formula**:
``` Δknowledge = (Accafter_video - Accbefore_video) / (100% - Accbefore_video) × 100% ```
| Component | Description | Interpretation |
|---|---|---|
| **Accbefore_video** | Accuracy without watching video | Prior knowledge baseline |
| **Accafter_video** | Accuracy after watching video | Post-learning performance |
| **Normalization** | Accounts for prior knowledge | True learning gain |
| **Result** | Percentage of potential learning achieved | Learning efficiency |
Evaluation Protocol
The evaluation follows a structured protocol:
1. **Baseline Assessment**: Model answers questions without video access 2. **Video Presentation**: Model watches the educational video 3. **Post-Video Assessment**: Model answers same questions after viewing 4. **Δknowledge Calculation**: Compute normalized learning gain 5. **Multi-run Averaging**: Multiple evaluations for statistical reliability
Performance Analysis
Current Leaderboard (January 2025)
Note: Performance scores may vary between different sources (GitHub repository vs HuggingFace leaderboard). The following scores are from the GitHub repository. For alternative scores, see the HuggingFace dataset page which shows Human Expert at 74.44% and Claude-3.5-Sonnet at 65.78%.
| Rank | Model | Overall Accuracy | Δknowledge | Perception | Comprehension | Adaptation |
|---|---|---|---|---|---|---|
| 1 | GPT-5-thinking | 84.6% | 28.3% | 91.2% | 85.5% | 77.1% |
| 2 | Gemini-2.5-Pro | 83.6% | 25.7% | 90.8% | 84.3% | 75.7% |
| 3 | Claude-3.5-Sonnet | 65.78% | 11.4% | 78.3% | 66.2% | 52.8% |
| 4 | GPT-4o | 62.5% | 15.6% | 75.2% | 63.1% | 49.2% |
| 5 | Qwen-VL-Max | 58.3% | 8.2% | 71.5% | 58.7% | 44.7% |
Performance Insights
Analysis reveals critical patterns in model capabilities[1]:
| Finding | Implication | Research Direction |
|---|---|---|
| **Cognitive Decline** | Performance drops with complexity | Better reasoning architectures needed |
| **Limited Learning** | Low Δknowledge scores | Improved knowledge acquisition methods |
| **Discipline Variance** | Science/Medicine harder than Humanities | Domain-specific optimization required |
| **Perception vs Application Gap** | ~40% drop from perception to adaptation | Enhanced knowledge transfer mechanisms |
Discipline-Specific Performance
Models show varying success across disciplines:
| Discipline | Average Accuracy | Δknowledge | Challenge Level |
|---|---|---|---|
| **Business** | 72.3% | 18.5% | Medium |
| **Humanities** | 68.7% | 16.2% | Medium |
| **Art** | 65.4% | 14.8% | Medium-High |
| **Engineering** | 61.2% | 12.3% | High |
| **Science** | 58.9% | 10.7% | High |
| **Medicine** | 55.3% | 9.1% | Very High |
Technical Implementation
Integration with LMMs-Eval
VideoMMMU is built on the LMMs-Eval framework[3]:
| Component | Implementation | Purpose |
|---|---|---|
| **Framework** | LMMs-Eval | Standardized evaluation |
| **Installation** | `pip install lmms-eval` | Easy setup |
| **Execution** | `accelerate launch -m lmms_eval` | Distributed evaluation |
| **Model Support** | Multiple architectures | Broad compatibility |
Dataset Access
The dataset is available through multiple channels:
| Platform | Access Method | Requirements |
|---|---|---|
| **HuggingFace** | Direct download | Agreement to terms |
| **GitHub** | Links to videos | Respect creator rights |
| **Official Website** | Sample browser | Public access |
Evaluation Pipeline
```python
- Simplified evaluation process
def evaluate_videommmu(model, video, questions):
# Stage 1: Baseline (no video) baseline_acc = model.answer_questions(questions) # Stage 2: Watch video model.process_video(video) # Stage 3: Post-video assessment post_video_acc = model.answer_questions(questions) # Calculate Δknowledge delta_knowledge = calculate_delta(baseline_acc, post_video_acc) return delta_knowledge
```
Research Impact
Influence on Multi-modal Research
VideoMMMU has catalyzed several research directions:
| Area | Impact | Active Research |
|---|---|---|
| **Educational AI** | New evaluation standards | Learning-optimized architectures |
| **Video Understanding** | Beyond comprehension to learning | Knowledge extraction methods |
| **Multi-modal Learning** | Integration of learning metrics | Cross-modal knowledge transfer |
| **Cognitive Modeling** | Three-stage assessment adoption | Hierarchical reasoning systems |
Related Benchmarks
| Benchmark | Focus | Key Difference from VideoMMMU |
|---|---|---|
| Video-MME | Comprehensive video analysis | Breadth vs knowledge acquisition |
| MVBench | Temporal dynamics | Motion vs learning |
| LVBench | Long video understanding | Duration vs education |
| EgoSchema | Egocentric video understanding | Perspective vs knowledge |
| Video-ChatGPT | Video conversation | Dialogue vs learning assessment |
Applications and Use Cases
Educational Technology
VideoMMMU enables development of:
- **Intelligent Tutoring Systems**: AI that learns from educational content
- **Adaptive Learning Platforms**: Systems that acquire domain knowledge
- **Content Understanding Tools**: Automated lecture summarization with learning
- **Professional Training AI**: Systems for specialized education
Industry Applications
| Industry | Application | Benefit |
|---|---|---|
| **EdTech** | Learning assessment platforms | Better student evaluation |
| **Corporate Training** | Professional development systems | Efficient knowledge transfer |
| **Healthcare Education** | Medical training AI | Specialized learning support |
| **Online Education** | MOOC enhancement | Improved learning outcomes |
Limitations and Future Work
Current Limitations
| Limitation | Description | Impact |
|---|---|---|
| **English Only** | Single language support | Limited global reach |
| **Domain Coverage** | Six disciplines | May miss specialized fields |
| **Question Quantity** | Three per video | Statistical limitations |
| **Video Sources** | Web-based content | Quality variation |
| **Evaluation Cost** | Compute-intensive | Accessibility issues |
Future Directions
Planned improvements include:
1. **Multilingual Expansion**: Support for 10+ languages 2. **Interactive Learning**: Multi-turn educational dialogue 3. **Personalized Assessment**: Adaptive difficulty based on performance 4. **Real-time Learning**: Continuous knowledge acquisition evaluation 5. **Cross-modal Transfer**: Learning from mixed media sources
Significance
VideoMMMU represents a crucial advancement in evaluating AI systems' ability to learn from educational content, addressing a fundamental gap in current multi-modal benchmarks. By introducing the Δknowledge metric and three-stage cognitive assessment, it provides the first systematic framework for measuring knowledge acquisition from videos. The benchmark's revelation that even advanced models achieve limited learning gains highlights the significant challenges remaining in developing AI systems capable of human-like learning from educational content.
As educational technology becomes increasingly important and video content dominates online learning, VideoMMMU provides essential infrastructure for developing and evaluating AI systems that can truly learn from educational materials. Its multi-disciplinary approach and focus on knowledge transfer make it an indispensable tool for advancing educational AI and creating more capable learning systems.
See Also
- Video Understanding
- Multi-modal Learning
- Educational AI
- Large Multimodal Models
- Knowledge Acquisition
- Video-MME
- MVBench
- LMMs-Eval
References
- ↑ 1.0 1.1 1.2 1.3 Hu, K., Wu, P., Pu, F., Xiao, W., Zhang, Y., Yue, X., Li, B., & Liu, Z. (2025). "Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos". arXiv:2501.13826. Retrieved from https://arxiv.org/abs/2501.13826
- ↑ EvolvingLMMs-Lab. (2025). "VideoMMMU: Video Multi-Modal Multi-disciplinary Understanding". Official Website. Retrieved from https://videommmu.github.io/
- ↑ EvolvingLMMs-Lab. (2025). "VideoMMMU Repository". GitHub. Retrieved from https://github.com/EvolvingLMMs-Lab/VideoMMMU