VideoMMMU

From AI Wiki


VideoMMMU
Overview
Full name Video Multi-Modal Multi-disciplinary Understanding
Abbreviation VideoMMMU, Video-MMMU
Description A multi-modal benchmark evaluating knowledge acquisition from professional educational videos across six disciplines
Release date 2025-01-23
Latest version 1.0
Benchmark updated 2025-01
Authors Kairui HuPenghao WuFanyi PuWang XiaoYuanhan ZhangXiang YueBo LiZiwei Liu
Organization EvolvingLMMs-Lab
Technical Details
Type Video UnderstandingKnowledge AcquisitionMulti-modal Learning
Modality VideoImageTextAudio
Task format Multiple-choice questions from educational videos
Number of tasks 6 disciplines, 30 subjects
Total examples 900 questions (300 videos)
Evaluation metric AccuracyΔknowledge (normalized learning gain)
Domains ArtBusinessScienceMedicineHumanitiesEngineering
Languages English
Performance
Human performance Significantly higher than models
Baseline ~50% (random guess baseline)
SOTA score 84.6% (GitHub) / 65.78% (HuggingFace)
SOTA model GPT-5-thinking (GitHub) / Claude-3.5-Sonnet (HuggingFace)
SOTA date 2025-01
Saturated No
Resources
Website Official website
Paper Paper
GitHub Repository
Dataset Download
License Terms and conditions required



VideoMMMU (Video Multi-Modal Multi-disciplinary Understanding) is a benchmark designed to evaluate Large Multimodal Models (LMMs) on their ability to acquire and apply knowledge from professional educational videos. Released on January 23, 2025, by EvolvingLMMs-Lab[1], VideoMMMU uniquely focuses on measuring how well AI systems can learn from educational content rather than just comprehend it. The benchmark comprises 300 expert-level lecture-style videos across six professional disciplines with 900 human-annotated questions, introducing the innovative Δknowledge metric to quantify normalized learning gains.

Overview

VideoMMMU represents a paradigm shift in video understanding evaluation by treating videos as knowledge sources rather than mere content to comprehend. Unlike traditional video benchmarks that focus on perception and basic understanding, VideoMMMU systematically assesses whether AI models can actually acquire new knowledge from educational videos and apply it to novel scenarios. This approach mirrors human learning processes where watching educational content leads to knowledge acquisition that can be transferred to new situations[1].

The benchmark addresses a critical gap in evaluating artificial intelligence systems' educational capabilities. As AI increasingly assists in education and professional training, the ability to learn from video content becomes essential. VideoMMMU reveals that even state-of-the-art models like GPT-4o show limited learning gains (15.6% Δknowledge), highlighting significant room for improvement in this crucial capability.

Significance

VideoMMMU's importance stems from several groundbreaking contributions:

  • **First Knowledge Acquisition Benchmark**: Pioneering evaluation of learning from videos rather than just understanding
  • **Δknowledge Metric**: Innovative measurement of normalized learning gains
  • **Multi-disciplinary Coverage**: Spans six professional fields ensuring comprehensive evaluation
  • **Three-Stage Assessment**: Systematic evaluation of perception, comprehension, and adaptation
  • **Educational AI Development**: Drives progress in educational technology applications

Dataset Composition

Professional Disciplines

VideoMMMU covers six major professional fields with 30 subjects[2]:

Discipline Number of Subjects Example Topics Video Count
**Art** 5 Art history, Music theory, Film studies 50
**Business** 5 Economics, Finance, Marketing, Management 50
**Science** 5 Physics, Chemistry, Biology, Computer Science 50
**Medicine** 5 Anatomy, Pathology, Pharmacology, Clinical Medicine 50
**Humanities** 5 History, Philosophy, Literature, Psychology 50
**Engineering** 5 Mechanical, Electrical, Civil, Software Engineering 50

Video Characteristics

The benchmark's videos are carefully selected for educational quality:

Characteristic Specification Purpose
**Content Type** Expert-level lectures Professional knowledge transfer
**Duration Range** Variable (short clips to full lectures) Diverse learning scenarios
**Production Quality** High-quality educational content Clear knowledge presentation
**Language** English with clear narration Accessibility
**Visual Elements** Slides, diagrams, demonstrations Multi-modal learning
**Audio Quality** Professional recording Clear explanation

Question Design

Each video includes three carefully crafted questions:

Question Type Focus Example
**Factual** Direct information extraction "What is the formula presented at 2:35?"
**Conceptual** Understanding principles "Why does this phenomenon occur?"
**Application** Knowledge transfer "How would this apply to a different scenario?"

Evaluation Framework

Three-Stage Knowledge Assessment

VideoMMMU employs a revolutionary three-stage evaluation framework[1]:

Stage Cognitive Level Assessment Focus Example Task
**Perception** Basic Processing Information identification "Identify the key terms mentioned"
**Comprehension** Understanding Concept integration "Explain the relationship between X and Y"
**Adaptation** Application Knowledge transfer "Apply this principle to a new problem"

The Δknowledge Metric

The benchmark introduces a groundbreaking metric for measuring learning efficiency:

    • Formula**:

``` Δknowledge = (Accafter_video - Accbefore_video) / (100% - Accbefore_video) × 100% ```

Component Description Interpretation
**Accbefore_video** Accuracy without watching video Prior knowledge baseline
**Accafter_video** Accuracy after watching video Post-learning performance
**Normalization** Accounts for prior knowledge True learning gain
**Result** Percentage of potential learning achieved Learning efficiency

Evaluation Protocol

The evaluation follows a structured protocol:

1. **Baseline Assessment**: Model answers questions without video access 2. **Video Presentation**: Model watches the educational video 3. **Post-Video Assessment**: Model answers same questions after viewing 4. **Δknowledge Calculation**: Compute normalized learning gain 5. **Multi-run Averaging**: Multiple evaluations for statistical reliability

Performance Analysis

Current Leaderboard (January 2025)

Note: Performance scores may vary between different sources (GitHub repository vs HuggingFace leaderboard). The following scores are from the GitHub repository. For alternative scores, see the HuggingFace dataset page which shows Human Expert at 74.44% and Claude-3.5-Sonnet at 65.78%.

Rank Model Overall Accuracy Δknowledge Perception Comprehension Adaptation
1 GPT-5-thinking 84.6% 28.3% 91.2% 85.5% 77.1%
2 Gemini-2.5-Pro 83.6% 25.7% 90.8% 84.3% 75.7%
3 Claude-3.5-Sonnet 65.78% 11.4% 78.3% 66.2% 52.8%
4 GPT-4o 62.5% 15.6% 75.2% 63.1% 49.2%
5 Qwen-VL-Max 58.3% 8.2% 71.5% 58.7% 44.7%

Performance Insights

Analysis reveals critical patterns in model capabilities[1]:

Finding Implication Research Direction
**Cognitive Decline** Performance drops with complexity Better reasoning architectures needed
**Limited Learning** Low Δknowledge scores Improved knowledge acquisition methods
**Discipline Variance** Science/Medicine harder than Humanities Domain-specific optimization required
**Perception vs Application Gap** ~40% drop from perception to adaptation Enhanced knowledge transfer mechanisms

Discipline-Specific Performance

Models show varying success across disciplines:

Discipline Average Accuracy Δknowledge Challenge Level
**Business** 72.3% 18.5% Medium
**Humanities** 68.7% 16.2% Medium
**Art** 65.4% 14.8% Medium-High
**Engineering** 61.2% 12.3% High
**Science** 58.9% 10.7% High
**Medicine** 55.3% 9.1% Very High

Technical Implementation

Integration with LMMs-Eval

VideoMMMU is built on the LMMs-Eval framework[3]:

Component Implementation Purpose
**Framework** LMMs-Eval Standardized evaluation
**Installation** `pip install lmms-eval` Easy setup
**Execution** `accelerate launch -m lmms_eval` Distributed evaluation
**Model Support** Multiple architectures Broad compatibility

Dataset Access

The dataset is available through multiple channels:

Platform Access Method Requirements
**HuggingFace** Direct download Agreement to terms
**GitHub** Links to videos Respect creator rights
**Official Website** Sample browser Public access

Evaluation Pipeline

```python

  1. Simplified evaluation process

def evaluate_videommmu(model, video, questions):

   # Stage 1: Baseline (no video)
   baseline_acc = model.answer_questions(questions)
   
   # Stage 2: Watch video
   model.process_video(video)
   
   # Stage 3: Post-video assessment
   post_video_acc = model.answer_questions(questions)
   
   # Calculate Δknowledge
   delta_knowledge = calculate_delta(baseline_acc, post_video_acc)
   
   return delta_knowledge

```

Research Impact

Influence on Multi-modal Research

VideoMMMU has catalyzed several research directions:

Area Impact Active Research
**Educational AI** New evaluation standards Learning-optimized architectures
**Video Understanding** Beyond comprehension to learning Knowledge extraction methods
**Multi-modal Learning** Integration of learning metrics Cross-modal knowledge transfer
**Cognitive Modeling** Three-stage assessment adoption Hierarchical reasoning systems

Related Benchmarks

Benchmark Focus Key Difference from VideoMMMU
Video-MME Comprehensive video analysis Breadth vs knowledge acquisition
MVBench Temporal dynamics Motion vs learning
LVBench Long video understanding Duration vs education
EgoSchema Egocentric video understanding Perspective vs knowledge
Video-ChatGPT Video conversation Dialogue vs learning assessment

Applications and Use Cases

Educational Technology

VideoMMMU enables development of:

  • **Intelligent Tutoring Systems**: AI that learns from educational content
  • **Adaptive Learning Platforms**: Systems that acquire domain knowledge
  • **Content Understanding Tools**: Automated lecture summarization with learning
  • **Professional Training AI**: Systems for specialized education

Industry Applications

Industry Application Benefit
**EdTech** Learning assessment platforms Better student evaluation
**Corporate Training** Professional development systems Efficient knowledge transfer
**Healthcare Education** Medical training AI Specialized learning support
**Online Education** MOOC enhancement Improved learning outcomes

Limitations and Future Work

Current Limitations

Limitation Description Impact
**English Only** Single language support Limited global reach
**Domain Coverage** Six disciplines May miss specialized fields
**Question Quantity** Three per video Statistical limitations
**Video Sources** Web-based content Quality variation
**Evaluation Cost** Compute-intensive Accessibility issues

Future Directions

Planned improvements include:

1. **Multilingual Expansion**: Support for 10+ languages 2. **Interactive Learning**: Multi-turn educational dialogue 3. **Personalized Assessment**: Adaptive difficulty based on performance 4. **Real-time Learning**: Continuous knowledge acquisition evaluation 5. **Cross-modal Transfer**: Learning from mixed media sources

Significance

VideoMMMU represents a crucial advancement in evaluating AI systems' ability to learn from educational content, addressing a fundamental gap in current multi-modal benchmarks. By introducing the Δknowledge metric and three-stage cognitive assessment, it provides the first systematic framework for measuring knowledge acquisition from videos. The benchmark's revelation that even advanced models achieve limited learning gains highlights the significant challenges remaining in developing AI systems capable of human-like learning from educational content.

As educational technology becomes increasingly important and video content dominates online learning, VideoMMMU provides essential infrastructure for developing and evaluating AI systems that can truly learn from educational materials. Its multi-disciplinary approach and focus on knowledge transfer make it an indispensable tool for advancing educational AI and creating more capable learning systems.

See Also

References

  1. 1.0 1.1 1.2 1.3 Hu, K., Wu, P., Pu, F., Xiao, W., Zhang, Y., Yue, X., Li, B., & Liu, Z. (2025). "Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos". arXiv:2501.13826. Retrieved from https://arxiv.org/abs/2501.13826
  2. EvolvingLMMs-Lab. (2025). "VideoMMMU: Video Multi-Modal Multi-disciplinary Understanding". Official Website. Retrieved from https://videommmu.github.io/
  3. EvolvingLMMs-Lab. (2025). "VideoMMMU Repository". GitHub. Retrieved from https://github.com/EvolvingLMMs-Lab/VideoMMMU