Video-MMMU

Video-MMMU
Overview
Full name	Video Multi-Modal Multi-disciplinary Understanding
Abbreviation	Video-MMMU, VideoMMMU
Description	A multi-modal benchmark that evaluates how Large Multimodal Models acquire and apply knowledge from professional educational videos across six disciplines
Release date	2025-01-23 (arXiv v1)
Latest version	1.0
Authors	Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, Ziwei Liu
Affiliations	S-Lab, Nanyang Technological University; Carnegie Mellon University
Maintainer	EvolvingLMMs-Lab
Technical Details
Type	Video understanding, knowledge acquisition, multi-modal learning
Modality	Video, image, text, audio
Task format	Multiple-choice and open-ended questions tied to lecture-style videos
Number of disciplines	6 disciplines, 30 subjects
Total examples	300 videos, 900 questions (3 per video)
Average video length	About 506 seconds (roughly 8.4 minutes)
Evaluation metrics	Accuracy, Δknowledge (normalized learning gain)
Domains	Art, Business, Science, Medicine, Humanities, Engineering
Languages	English
Performance
Human expert (overall)	74.44%
Random baseline	14.00%
Human Δknowledge	33.1%
Best model in paper	Claude-3.5-Sonnet, 65.78% overall, +11.4% Δknowledge
GPT-4o (paper)	61.22% overall, +15.6% Δknowledge
Latest leaderboard top	GPT-5-thinking at 84.6% (GitHub leaderboard, 2026)
Saturated	No
Resources
Website	videommmu.github.io
Paper	arXiv:2501.13826
GitHub	EvolvingLMMs-Lab/VideoMMMU
Dataset	HuggingFace
Evaluation framework	LMMs-Eval

Video-MMMU (Video Multi-Modal Multi-disciplinary Understanding, sometimes written VideoMMMU) is a benchmark that measures whether Large Multimodal Models can acquire new knowledge from professional educational videos and apply it to novel problems. It was introduced by Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu in a paper posted to arXiv on January 23, 2025^[1]. Seven of the authors are from S-Lab at Nanyang Technological University in Singapore; Xiang Yue is from Carnegie Mellon University. The benchmark is maintained as part of the EvolvingLMMs-Lab project on GitHub^[2].

Video-MMMU pairs 300 college-level lecture videos with 900 human-annotated questions across six disciplines (Art, Business, Science, Medicine, Humanities, and Engineering) spread over 30 subjects. Each video has three questions that test progressively harder cognitive abilities: perception, comprehension, and adaptation. The paper introduces Δknowledge, a metric that normalizes how much of the headroom from the pre-video baseline a model closes after watching^[1].

Background and motivation

The benchmark builds on the MMMU family of evaluations, which started with a static-image benchmark covering college-level material across many disciplines. Video-MMMU keeps the multi-discipline structure of MMMU but moves the modality from images to lecture videos. The authors argue that prior video benchmarks (such as Video-MME, MVBench, LVBench, and EgoSchema) mostly test perception or temporal reasoning rather than learning^[1].

The team frames its design around a simple cognitive model: people perceive information, comprehend underlying concepts, and adapt that knowledge to new problems. Video-MMMU encodes that loop into the question structure for every video.

Dataset composition

Disciplines and subjects

Video-MMMU has 50 videos per discipline, with five subjects per discipline and ten videos per subject^[1].

Discipline	Videos	Example topics
Art	50	Art history, art theory, design, music, film
Business	50	Economics, finance, accounting, management, marketing
Science	50	Physics, chemistry, biology, computer science, mathematics
Medicine	50	Anatomy, pathology, pharmacology, clinical, public health
Humanities	50	History, philosophy, literature, psychology, sociology
Engineering	50	Mechanical, electrical, civil, software, materials

Video characteristics

The authors collected lecture-style videos from publicly available educational sources. The average duration is about 506.2 seconds, roughly 8.4 minutes per video^[1]. Videos fall into two broad styles: concept-introduction videos that walk through a topic expositorily, and problem-solving videos that demonstrate a worked example step by step. Comprehension and adaptation questions look different depending on which style a video uses.

Question structure

Each video has three questions, one per cognitive stage^[1].

Stage	Question form	Sub-types in the paper
Perception	Multiple choice	OCR (visual text), ASR (spoken content)
Comprehension	Multiple choice (4 to 10 options)	Concept comprehension, problem-solving strategy comprehension
Adaptation	Mixed (multiple choice and open-ended)	Case study analysis, problem-solving adaptation

The average question is 75.7 words long, reflecting that adaptation questions often introduce a new scenario in the prompt^[1]. Annotators were instructed to make adaptation questions solvable only with knowledge presented in the video.

The Δknowledge metric

Why a delta

A raw post-video accuracy score conflates what the model knew before the video with what it picked up from watching. A frontier model with strong pretraining can answer many questions cold, so its post-video score will look high regardless of whether the video helped. Δknowledge separates those two effects^[1].

The formula is:

Δknowledge = (Acc_after - Acc_before) / (100% - Acc_before) × 100%

Term	Meaning
Acc_before	Accuracy on adaptation questions when the model is shown only the question, not the video
Acc_after	Accuracy on the same adaptation questions when the model is shown the video plus the question
100% - Acc_before	Headroom for improvement after the baseline
Result	Percentage of the available headroom that the model actually captures

The metric is reported on the adaptation track only, because perception and comprehension questions are unanswerable without the video^[1].

What the human number looks like

Human experts averaged 33.1% Δknowledge^[1]. After watching the video, they closed about a third of the gap between baseline accuracy and a perfect adaptation score. Model numbers are well below this. GPT-4o reached 15.6%, Claude-3.5-Sonnet 11.4%, VILA-1.5-40B 9.4%, and Gemini 1.5 Pro 8.7%. Two open-source models, LongVA and InternVL2-8B, posted negative Δknowledge values (-7.0% and -8.5%), meaning accuracy dropped after they saw the video^[1]. The authors read negative deltas as the extra context distracting the model rather than helping it.

Evaluation protocol

The protocol has two passes per question^[1].

The model is given the question text alone (and answer options for multiple choice). It produces an answer. This is the pre-video pass.
The model is given the same question plus the relevant video. It produces an answer. This is the post-video pass.

For perception and comprehension questions, only the post-video pass is scored. For adaptation questions, both passes are scored and the difference feeds the Δknowledge calculation. The code is integrated with LMMs-Eval^[3], so a typical run uses accelerate launch -m lmms_eval with the videommmu task.

Results from the original paper

Overall accuracy across stages

The v1 paper reported results for around ten models on the full 900-question benchmark^[1].

Model	Overall	Perception	Comprehension	Adaptation
Human expert	74.44	84.33	78.67	60.33
Claude-3.5-Sonnet	65.78	72.00	69.67	55.67
GPT-4o	61.22	66.00	62.00	55.67
Gemini 1.5 Pro	53.89	59.00	53.33	49.33
Aria	50.78	65.67	46.67	40.00
LLaVA-OneVision-72B	48.33	59.67	42.33	43.00
Random choice	14.00	12.00	14.00	16.00

The pattern is consistent across models: perception is easier than comprehension, and comprehension is easier than adaptation. The drop from perception to adaptation is usually 10 to 20 percentage points, consistent with the authors' claim that knowledge transfer is the hard part.

Δknowledge across the same models

Entity	Δknowledge
Human expert	+33.1%
GPT-4o	+15.6%
Claude-3.5-Sonnet	+11.4%
VILA-1.5-40B	+9.4%
Gemini 1.5 Pro	+8.7%
LongVA	-7.0%
InternVL2-8B	-8.5%

The Δknowledge ranking is not the same as the overall-accuracy ranking. Claude-3.5-Sonnet beats GPT-4o on overall accuracy but loses to it on Δknowledge, suggesting Claude has stronger pretrained knowledge while GPT-4o picks up more new information from the video^[1].

Wrong-to-right and right-to-wrong analysis

The paper also tracks how often an adaptation answer flips after the video. Human experts flipped wrong to right 40.4% of the time and right to wrong 10.7% of the time. GPT-4o flipped wrong to right 28.0% of the time and right to wrong 13.3% of the time. LongVA flipped wrong to right only 13.6% of the time but right to wrong 54.0% of the time, which is the strongest evidence that the video confused rather than helped that model^[1].

Discipline-level performance

Discipline	Human	Claude-3.5-Sonnet	GPT-4o	Aria
Art	80.95	66.67	69.52	71.43
Business	78.79	75.00	66.88	47.73
Science	74.24	56.06	51.55	44.70
Medicine	70.54	58.14	64.76	58.92
Humanities	84.76	75.24	69.52	62.86
Engineering	69.91	66.08	57.13	43.66

Science and engineering are the hardest disciplines for current models, while humanities and business are easier. The authors attribute the science and engineering gap to numeric and symbolic reasoning in those adaptation questions, a known weak point for video LMMs^[1].

The audio-track trade-off

One ablation in the paper looks at what happens when audio transcripts (generated by Whisper) are appended to the prompt^[1]. The transcripts improve perception and comprehension scores noticeably, because narration often spells out information shown on screen. On the adaptation track, the transcripts hurt several models. The authors call this a trade-off: audio helps the model understand the lecture but anchors it to the original example rather than the new scenario in the adaptation question. The finding has been cited in later video-LMM work as evidence that audio integration is harder than it looks for transfer tasks.

Error analysis

For Claude-3.5-Sonnet, the authors categorized failure modes on the adaptation track^[1].

Error type	Share
Method adaptation error	64%
Question misreading	15%
Method selection error	8%
Other (refusal, annotation issues, extraction failures)	13%

Method adaptation errors dominate. The model picks the right strategy from the video but fails to apply it to the new problem, matching the broader pattern that current LMMs can recognize a procedure but struggle to execute it on a fresh input.

Updated leaderboard

The GitHub leaderboard tracks results as new models ship^[2]. As of early 2026, the top entries are dominated by reasoning-tuned closed models.

Rank	Model	Overall	Notes
1	GPT-5-thinking	84.6	OpenAI reasoning model
2	Gemini 2.5 Pro	83.6	Google DeepMind reasoning model
3	OpenAI o3	83.3	OpenAI reasoning model
4	Keye-VL-1.5-8B	66.00	+0.0% Δknowledge
5	Claude-3.5-Sonnet	65.78	+11.4% Δknowledge (paper baseline)
6	Kimi-VL-A3B-Thinking-2506	65.22	+3.5% Δknowledge
7	GPT-4o	61.22	+15.6% Δknowledge
8	Qwen-2.5-VL-72B	60.22	+9.7% Δknowledge

Newer reasoning models close most of the overall-accuracy gap with humans, but per-stage and Δknowledge numbers for them have been published less consistently. Notably, Keye-VL-1.5-8B reaches 66.0% overall while posting a Δknowledge of +0.0%, meaning most of its accuracy comes from prior knowledge rather than learning during the run^[2]. This split is exactly what the metric was designed to expose.

Benchmark	Modality	Distinguishing feature vs Video-MMMU
MMMU	Image, text	Static images, no learning signal
MMMU-Pro	Image, text	Harder MMMU, no Δ metric
Video-MME	Video, audio, text	Comprehension only, no Δ metric
MMVU	Video, text	More videos and subjects, no before/after split
MVBench	Video	Action and motion focus
LVBench	Long video	Hours-long videos, no learning signal
EgoSchema	Egocentric video	First-person activity recognition

Video-MMMU's distinguishing piece is the before/after protocol on the adaptation track. Most other video benchmarks score a single pass and ignore what the model knew going in.

Significance and reception

Video-MMMU is one of the first benchmarks to operationalize knowledge acquisition rather than knowledge recall. Several model release notes from 2025 and early 2026 cite it: Moonshot AI, Alibaba's Qwen team, and DAMO Academy have reported Video-MMMU scores in their model cards^[2]. The Δknowledge framing has also pushed subsequent video-LMM work toward reporting a learning-gain number instead of only a final accuracy.

The benchmark also feeds educational AI research. The adaptation track is essentially a small test of whether a model can act as a tutor that watches a lecture and helps a student work through a related problem. Even strong models cluster well below human Δknowledge, suggesting automated tutoring built on current LMMs will not match a human teacher's adaptive ability without further work.

Limitations

Limitation	Description
English only	All videos and questions are in English
Six disciplines	Coverage is broad but not exhaustive; specialized fields like law, agriculture, and architecture are not represented
Three questions per video	Statistical resolution per video is limited
Public video sources	Sources vary in quality and presentation style
Compute-heavy	Running the full benchmark on long-context video models is expensive

The authors flag these limitations as starting points for follow-up work rather than fundamental flaws^[1].

References

Hu, K., Wu, P., Pu, F., Xiao, W., Zhang, Y., Yue, X., Li, B., and Liu, Z. (2025). "Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos." arXiv:2501.13826. https://arxiv.org/abs/2501.13826
EvolvingLMMs-Lab. "VideoMMMU GitHub repository." https://github.com/EvolvingLMMs-Lab/VideoMMMU
LMMs-Lab. "LMMs-Eval framework." https://github.com/EvolvingLMMs-Lab/lmms-eval
Video-MMMU project page. https://videommmu.github.io/
LMMs-Lab. "Video-MMMU: Evaluating Knowledge Acquisition from Educational Videos." Blog post. https://www.lmms-lab.com/posts/videommmu/
HuggingFace dataset card. "lmms-lab/VideoMMMU." https://huggingface.co/datasets/lmms-lab/VideoMMMU

Background and motivation

Dataset composition

Disciplines and subjects

Video characteristics

Question structure

The Δknowledge metric

Why a delta

What the human number looks like

Evaluation protocol

Results from the original paper

Overall accuracy across stages

Δknowledge across the same models

Wrong-to-right and right-to-wrong analysis

Discipline-level performance

The audio-track trade-off

Error analysis

Updated leaderboard

Comparison with related benchmarks

Significance and reception

Limitations

See also

References

Improve this article

Related Articles

Humanity's Last Exam

AA-LCR

GSO

AIME 2025

BrowseComp

Creative Writing v3

Background and motivation

Dataset composition

Disciplines and subjects

Video characteristics

Question structure

The Δknowledge metric

Why a delta

What the human number looks like

Evaluation protocol

Results from the original paper

Overall accuracy across stages

Δknowledge across the same models

Wrong-to-right and right-to-wrong analysis

Discipline-level performance

The audio-track trade-off

Error analysis

Updated leaderboard

Comparison with related benchmarks

Significance and reception

Limitations

See also

References

Related Articles

Humanity's Last Exam

AA-LCR

GSO

AIME 2025

BrowseComp

Creative Writing v3