Video-MMMU
Last reviewed
May 10, 2026
Sources
6 citations
Review status
Source-backed
Revision
v2 · 2,481 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 10, 2026
Sources
6 citations
Review status
Source-backed
Revision
v2 · 2,481 words
Add missing citations, update stale details, or suggest a clearer explanation.
| Video-MMMU | |
|---|---|
| Overview | |
| Full name | Video Multi-Modal Multi-disciplinary Understanding |
| Abbreviation | Video-MMMU, VideoMMMU |
| Description | A multi-modal benchmark that evaluates how Large Multimodal Models acquire and apply knowledge from professional educational videos across six disciplines |
| Release date | 2025-01-23 (arXiv v1) |
| Latest version | 1.0 |
| Authors | Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, Ziwei Liu |
| Affiliations | S-Lab, Nanyang Technological University; Carnegie Mellon University |
| Maintainer | EvolvingLMMs-Lab |
| Technical Details | |
| Type | Video understanding, knowledge acquisition, multi-modal learning |
| Modality | Video, image, text, audio |
| Task format | Multiple-choice and open-ended questions tied to lecture-style videos |
| Number of disciplines | 6 disciplines, 30 subjects |
| Total examples | 300 videos, 900 questions (3 per video) |
| Average video length | About 506 seconds (roughly 8.4 minutes) |
| Evaluation metrics | Accuracy, Δknowledge (normalized learning gain) |
| Domains | Art, Business, Science, Medicine, Humanities, Engineering |
| Languages | English |
| Performance | |
| Human expert (overall) | 74.44% |
| Random baseline | 14.00% |
| Human Δknowledge | 33.1% |
| Best model in paper | Claude-3.5-Sonnet, 65.78% overall, +11.4% Δknowledge |
| GPT-4o (paper) | 61.22% overall, +15.6% Δknowledge |
| Latest leaderboard top | GPT-5-thinking at 84.6% (GitHub leaderboard, 2026) |
| Saturated | No |
| Resources | |
| Website | videommmu.github.io |
| Paper | arXiv:2501.13826 |
| GitHub | EvolvingLMMs-Lab/VideoMMMU |
| Dataset | HuggingFace |
| Evaluation framework | LMMs-Eval |
Video-MMMU (Video Multi-Modal Multi-disciplinary Understanding, sometimes written VideoMMMU) is a benchmark that measures whether Large Multimodal Models can acquire new knowledge from professional educational videos and apply it to novel problems. It was introduced by Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu in a paper posted to arXiv on January 23, 2025[1]. Seven of the authors are from S-Lab at Nanyang Technological University in Singapore; Xiang Yue is from Carnegie Mellon University. The benchmark is maintained as part of the EvolvingLMMs-Lab project on GitHub[2].
Video-MMMU pairs 300 college-level lecture videos with 900 human-annotated questions across six disciplines (Art, Business, Science, Medicine, Humanities, and Engineering) spread over 30 subjects. Each video has three questions that test progressively harder cognitive abilities: perception, comprehension, and adaptation. The paper introduces Δknowledge, a metric that normalizes how much of the headroom from the pre-video baseline a model closes after watching[1].
The benchmark builds on the MMMU family of evaluations, which started with a static-image benchmark covering college-level material across many disciplines. Video-MMMU keeps the multi-discipline structure of MMMU but moves the modality from images to lecture videos. The authors argue that prior video benchmarks (such as Video-MME, MVBench, LVBench, and EgoSchema) mostly test perception or temporal reasoning rather than learning[1].
The team frames its design around a simple cognitive model: people perceive information, comprehend underlying concepts, and adapt that knowledge to new problems. Video-MMMU encodes that loop into the question structure for every video.
Video-MMMU has 50 videos per discipline, with five subjects per discipline and ten videos per subject[1].
| Discipline | Videos | Example topics |
|---|---|---|
| Art | 50 | Art history, art theory, design, music, film |
| Business | 50 | Economics, finance, accounting, management, marketing |
| Science | 50 | Physics, chemistry, biology, computer science, mathematics |
| Medicine | 50 | Anatomy, pathology, pharmacology, clinical, public health |
| Humanities | 50 | History, philosophy, literature, psychology, sociology |
| Engineering | 50 | Mechanical, electrical, civil, software, materials |
The authors collected lecture-style videos from publicly available educational sources. The average duration is about 506.2 seconds, roughly 8.4 minutes per video[1]. Videos fall into two broad styles: concept-introduction videos that walk through a topic expositorily, and problem-solving videos that demonstrate a worked example step by step. Comprehension and adaptation questions look different depending on which style a video uses.
Each video has three questions, one per cognitive stage[1].
| Stage | Question form | Sub-types in the paper |
|---|---|---|
| Perception | Multiple choice | OCR (visual text), ASR (spoken content) |
| Comprehension | Multiple choice (4 to 10 options) | Concept comprehension, problem-solving strategy comprehension |
| Adaptation | Mixed (multiple choice and open-ended) | Case study analysis, problem-solving adaptation |
The average question is 75.7 words long, reflecting that adaptation questions often introduce a new scenario in the prompt[1]. Annotators were instructed to make adaptation questions solvable only with knowledge presented in the video.
A raw post-video accuracy score conflates what the model knew before the video with what it picked up from watching. A frontier model with strong pretraining can answer many questions cold, so its post-video score will look high regardless of whether the video helped. Δknowledge separates those two effects[1].
The formula is:
Δknowledge = (Acc_after - Acc_before) / (100% - Acc_before) × 100%
| Term | Meaning |
|---|---|
| Acc_before | Accuracy on adaptation questions when the model is shown only the question, not the video |
| Acc_after | Accuracy on the same adaptation questions when the model is shown the video plus the question |
| 100% - Acc_before | Headroom for improvement after the baseline |
| Result | Percentage of the available headroom that the model actually captures |
The metric is reported on the adaptation track only, because perception and comprehension questions are unanswerable without the video[1].
Human experts averaged 33.1% Δknowledge[1]. After watching the video, they closed about a third of the gap between baseline accuracy and a perfect adaptation score. Model numbers are well below this. GPT-4o reached 15.6%, Claude-3.5-Sonnet 11.4%, VILA-1.5-40B 9.4%, and Gemini 1.5 Pro 8.7%. Two open-source models, LongVA and InternVL2-8B, posted negative Δknowledge values (-7.0% and -8.5%), meaning accuracy dropped after they saw the video[1]. The authors read negative deltas as the extra context distracting the model rather than helping it.
The protocol has two passes per question[1].
For perception and comprehension questions, only the post-video pass is scored. For adaptation questions, both passes are scored and the difference feeds the Δknowledge calculation. The code is integrated with LMMs-Eval[3], so a typical run uses accelerate launch -m lmms_eval with the videommmu task.
The v1 paper reported results for around ten models on the full 900-question benchmark[1].
| Model | Overall | Perception | Comprehension | Adaptation |
|---|---|---|---|---|
| Human expert | 74.44 | 84.33 | 78.67 | 60.33 |
| Claude-3.5-Sonnet | 65.78 | 72.00 | 69.67 | 55.67 |
| GPT-4o | 61.22 | 66.00 | 62.00 | 55.67 |
| Gemini 1.5 Pro | 53.89 | 59.00 | 53.33 | 49.33 |
| Aria | 50.78 | 65.67 | 46.67 | 40.00 |
| LLaVA-OneVision-72B | 48.33 | 59.67 | 42.33 | 43.00 |
| Random choice | 14.00 | 12.00 | 14.00 | 16.00 |
The pattern is consistent across models: perception is easier than comprehension, and comprehension is easier than adaptation. The drop from perception to adaptation is usually 10 to 20 percentage points, consistent with the authors' claim that knowledge transfer is the hard part.
| Entity | Δknowledge |
|---|---|
| Human expert | +33.1% |
| GPT-4o | +15.6% |
| Claude-3.5-Sonnet | +11.4% |
| VILA-1.5-40B | +9.4% |
| Gemini 1.5 Pro | +8.7% |
| LongVA | -7.0% |
| InternVL2-8B | -8.5% |
The Δknowledge ranking is not the same as the overall-accuracy ranking. Claude-3.5-Sonnet beats GPT-4o on overall accuracy but loses to it on Δknowledge, suggesting Claude has stronger pretrained knowledge while GPT-4o picks up more new information from the video[1].
The paper also tracks how often an adaptation answer flips after the video. Human experts flipped wrong to right 40.4% of the time and right to wrong 10.7% of the time. GPT-4o flipped wrong to right 28.0% of the time and right to wrong 13.3% of the time. LongVA flipped wrong to right only 13.6% of the time but right to wrong 54.0% of the time, which is the strongest evidence that the video confused rather than helped that model[1].
| Discipline | Human | Claude-3.5-Sonnet | GPT-4o | Aria |
|---|---|---|---|---|
| Art | 80.95 | 66.67 | 69.52 | 71.43 |
| Business | 78.79 | 75.00 | 66.88 | 47.73 |
| Science | 74.24 | 56.06 | 51.55 | 44.70 |
| Medicine | 70.54 | 58.14 | 64.76 | 58.92 |
| Humanities | 84.76 | 75.24 | 69.52 | 62.86 |
| Engineering | 69.91 | 66.08 | 57.13 | 43.66 |
Science and engineering are the hardest disciplines for current models, while humanities and business are easier. The authors attribute the science and engineering gap to numeric and symbolic reasoning in those adaptation questions, a known weak point for video LMMs[1].
One ablation in the paper looks at what happens when audio transcripts (generated by Whisper) are appended to the prompt[1]. The transcripts improve perception and comprehension scores noticeably, because narration often spells out information shown on screen. On the adaptation track, the transcripts hurt several models. The authors call this a trade-off: audio helps the model understand the lecture but anchors it to the original example rather than the new scenario in the adaptation question. The finding has been cited in later video-LMM work as evidence that audio integration is harder than it looks for transfer tasks.
For Claude-3.5-Sonnet, the authors categorized failure modes on the adaptation track[1].
| Error type | Share |
|---|---|
| Method adaptation error | 64% |
| Question misreading | 15% |
| Method selection error | 8% |
| Other (refusal, annotation issues, extraction failures) | 13% |
Method adaptation errors dominate. The model picks the right strategy from the video but fails to apply it to the new problem, matching the broader pattern that current LMMs can recognize a procedure but struggle to execute it on a fresh input.
The GitHub leaderboard tracks results as new models ship[2]. As of early 2026, the top entries are dominated by reasoning-tuned closed models.
| Rank | Model | Overall | Notes |
|---|---|---|---|
| 1 | GPT-5-thinking | 84.6 | OpenAI reasoning model |
| 2 | Gemini 2.5 Pro | 83.6 | Google DeepMind reasoning model |
| 3 | OpenAI o3 | 83.3 | OpenAI reasoning model |
| 4 | Keye-VL-1.5-8B | 66.00 | +0.0% Δknowledge |
| 5 | Claude-3.5-Sonnet | 65.78 | +11.4% Δknowledge (paper baseline) |
| 6 | Kimi-VL-A3B-Thinking-2506 | 65.22 | +3.5% Δknowledge |
| 7 | GPT-4o | 61.22 | +15.6% Δknowledge |
| 8 | Qwen-2.5-VL-72B | 60.22 | +9.7% Δknowledge |
Newer reasoning models close most of the overall-accuracy gap with humans, but per-stage and Δknowledge numbers for them have been published less consistently. Notably, Keye-VL-1.5-8B reaches 66.0% overall while posting a Δknowledge of +0.0%, meaning most of its accuracy comes from prior knowledge rather than learning during the run[2]. This split is exactly what the metric was designed to expose.
| Benchmark | Modality | Distinguishing feature vs Video-MMMU |
|---|---|---|
| MMMU | Image, text | Static images, no learning signal |
| MMMU-Pro | Image, text | Harder MMMU, no Δ metric |
| Video-MME | Video, audio, text | Comprehension only, no Δ metric |
| MMVU | Video, text | More videos and subjects, no before/after split |
| MVBench | Video | Action and motion focus |
| LVBench | Long video | Hours-long videos, no learning signal |
| EgoSchema | Egocentric video | First-person activity recognition |
Video-MMMU's distinguishing piece is the before/after protocol on the adaptation track. Most other video benchmarks score a single pass and ignore what the model knew going in.
Video-MMMU is one of the first benchmarks to operationalize knowledge acquisition rather than knowledge recall. Several model release notes from 2025 and early 2026 cite it: Moonshot AI, Alibaba's Qwen team, and DAMO Academy have reported Video-MMMU scores in their model cards[2]. The Δknowledge framing has also pushed subsequent video-LMM work toward reporting a learning-gain number instead of only a final accuracy.
The benchmark also feeds educational AI research. The adaptation track is essentially a small test of whether a model can act as a tutor that watches a lecture and helps a student work through a related problem. Even strong models cluster well below human Δknowledge, suggesting automated tutoring built on current LMMs will not match a human teacher's adaptive ability without further work.
| Limitation | Description |
|---|---|
| English only | All videos and questions are in English |
| Six disciplines | Coverage is broad but not exhaustive; specialized fields like law, agriculture, and architecture are not represented |
| Three questions per video | Statistical resolution per video is limited |
| Public video sources | Sources vary in quality and presentation style |
| Compute-heavy | Running the full benchmark on long-context video models is expensive |
The authors flag these limitations as starting points for follow-up work rather than fundamental flaws[1].