Video-MME

Video-MME (Video Multi-Modal Evaluation) is the first comprehensive, full-spectrum benchmark designed to evaluate multimodal large language models (MLLMs) on video analysis tasks. Introduced in a 2024 paper by Chaoyou Fu and collaborators from Nanjing University, the Chinese Academy of Sciences, Peking University, the Chinese University of Hong Kong, and several other institutions, Video-MME addresses a critical gap in the evaluation landscape by testing models across diverse video types, temporal durations, and data modalities. The benchmark was accepted at CVPR 2025 and has since become an industry-standard measure for assessing multimodal video understanding capabilities.

Video-MME consists of 900 manually selected videos spanning 254 total hours, paired with 2,700 expert-annotated question-answer pairs. Its design emphasizes four distinguishing features: diversity in video types across 6 primary visual domains and 30 subfields, temporal breadth from 11-second clips to hour-long recordings, multi-modal input integration (video frames, subtitles, and audio), and rigorous human annotation quality. Major AI companies including OpenAI, Google, and Moonshot AI have adopted Video-MME as a key benchmark for reporting the video understanding performance of their flagship models.

Background and Motivation

Before Video-MME, the evaluation landscape for video-capable multimodal models suffered from several limitations. Existing benchmarks such as MVBench, EgoSchema, and Video-Bench primarily focused on short video clips, typically under three minutes in length. This narrow temporal scope failed to capture the challenges that arise when models must process and reason over longer video content, where maintaining context, tracking narrative threads, and synthesizing information across extended timelines become essential.

Additionally, many prior benchmarks relied on videos sourced from existing academic datasets rather than open-domain content, limiting the diversity of real-world scenarios tested. Most also evaluated only visual frame understanding, ignoring the role of other modalities like subtitles and audio in comprehensive video comprehension.

The Video-MME authors identified several specific gaps in previous work:

Limited video duration: The longest videos in prior benchmarks like EgoSchema reached only 180 seconds, leaving long-form video understanding entirely untested.
Narrow domain coverage: Many benchmarks focused on specific video types (egocentric videos, movie clips, or instructional content) rather than spanning the full range of video content found online.
Single-modality evaluation: Previous benchmarks rarely assessed how additional modalities such as subtitles and audio tracks affect model performance.
Automated annotation: Several prior datasets relied on template-based or model-generated question-answer pairs, which can introduce artifacts and fail to capture nuanced video understanding.

These limitations motivated the creation of Video-MME as a "full-spectrum" evaluation tool that systematically addresses each shortcoming through careful dataset design and annotation methodology.

Dataset Composition

Scale and Structure

Video-MME comprises 900 videos with a combined duration of 254 hours. Each video is paired with exactly 3 expert-crafted questions, yielding 2,700 question-answer pairs in total. All questions follow a multiple-choice format with 4 answer options (A through D). The dataset also includes 744 subtitle files, with all long-duration videos accompanied by subtitles and 900 corresponding audio files.

Video Duration Categories

A defining feature of Video-MME is its systematic coverage of three temporal categories:

Duration Category	Time Range	Description
Short	Less than 2 minutes	Brief clips capturing single events or actions, averaging around 26 seconds of certificate length per question
Medium	4 to 15 minutes	Mid-length videos requiring tracking of multiple events, with median certificate lengths around 167 seconds
Long	30 to 60 minutes	Extended recordings demanding sustained context and temporal reasoning, with median certificate lengths of approximately 891 seconds

The overall duration range spans from 11 seconds to 1 hour. The average video duration across the entire dataset is approximately 1,018 seconds (about 17 minutes). Each duration category contains 300 videos, ensuring balanced representation.

The concept of "certificate length" is central to Video-MME's design. A certificate is defined as the minimum set of video sub-clips necessary and sufficient to convince a human verifier that the marked annotation is correct. The substantially longer certificate lengths for medium and long videos, compared to benchmarks like EgoSchema (which has a certificate length of roughly 100 seconds), confirm that Video-MME genuinely requires deep engagement with extended video content rather than relying on isolated frames.

Visual Domains and Subfields

Video-MME spans 6 primary visual domains, further divided into 30 fine-grained subfields:

Domain	Example Subfields
Knowledge	Astronomy, technology, science education, documentaries
Film and Television	News reports, TV dramas, movie clips, interviews
Sports Competition	Football, basketball, esports, track and field
Artistic Performance	Magic shows, music performances, dance, theater
Life Record	Fashion, cooking, travel vlogs, daily activities
Multilingual	Videos in languages other than English, covering varied cultural contexts

All videos were sourced from YouTube across 30 categorized tags, ensuring a wide variety of real-world content. This open-domain approach contrasts with benchmarks that draw from curated academic datasets, providing a more realistic assessment of how models handle the diversity of video content encountered in practical applications.

Beyond video frames, Video-MME integrates two additional modalities:

Subtitles: 744 of the 900 videos include subtitle files. All long-duration videos have subtitles, reflecting the practical reality that longer videos (documentaries, lectures, TV shows) typically include spoken dialogue or narration. The average subtitle length grows substantially with video duration: approximately 199 words for short videos, 1,426 words for medium videos, and 6,516 words for long videos.
Audio: All 900 videos include audio tracks that may contain speech, music, sound effects, and environmental sounds relevant to understanding the video content.

This multi-modal design allows researchers to evaluate models under different input configurations: video frames only, video frames with subtitles, and video frames with subtitles and audio.

Annotation Methodology

Question Design Process

Video-MME employs a rigorous three-step annotation pipeline:

Video Collection: Videos are sourced from YouTube across the 30 categorized tags, with deliberate selection across the three duration categories. The collection process prioritizes diversity in content type, visual complexity, and linguistic context.
Question-Answer Annotation: Expert annotators with strong English proficiency and extensive research experience watch each video in its entirety before designing 3 questions per video. Each question includes 4 potential answer options. Annotators are instructed to create questions that genuinely require understanding the video content rather than relying on common knowledge or visual shortcuts.
Quality Review: A multi-stage verification process ensures annotation quality. Cross-annotator verification checks for clarity and logical consistency. Questions are then filtered using Gemini 1.5 Pro in a question-only setting (without access to the video) to confirm that the questions cannot be answered from text alone. In this filtering step, the model achieved less than 15% accuracy, confirming that the questions truly depend on video understanding. Questions that failed quality checks were returned to annotators for revision.

Question and Task Types

Video-MME encompasses 12 distinct task types that span multiple cognitive levels:

Task Category	Examples
Perception	Action recognition, object recognition, attribute perception
Temporal Understanding	Temporal perception, temporal localization, event sequencing
Spatial Reasoning	Spatial relationship identification, scene layout understanding
Information Synthesis	Information synopsis, plot summarization, cross-segment reasoning
Counting	Object counting, event frequency estimation
Complex Reasoning	Causal inference, multi-step logical deduction

Shorter videos tend to emphasize perception-related tasks (identifying objects, actions, or attributes visible in a few frames), while longer videos predominantly feature tasks related to temporal reasoning, information synthesis, and complex multi-step reasoning that requires integrating information from across the full video.

Word Count Statistics

The complexity of questions and answers scales with video duration:

Metric	Short Videos	Medium Videos	Long Videos
Average question word count	11.5	12.2	14.5
Average option word count	17.2	20.6	31.0
Average answer word count	4.0	5.0	7.5
Average subtitle word count	198.6	1,425.6	6,515.6

The substantial increase in option and answer length for long videos reflects the greater complexity and specificity required to distinguish correct answers in extended content scenarios.

Evaluation Methodology

Input Configuration

Models are evaluated under two primary settings:

Without subtitles (w/o subs): Models receive only video frames (and optionally audio) as input. This tests pure visual understanding capability.
With subtitles (w/ subs): Models receive video frames along with corresponding subtitle text. This tests the model's ability to integrate visual and textual information.

For frame-based evaluation, the benchmark specifies that if a model extracts N frames per video, the N subtitles corresponding to those frame timestamps should be provided when evaluating with subtitles.

Scoring

Evaluation uses straightforward accuracy: the percentage of questions for which the model selects the correct answer option. Scores are reported across multiple dimensions:

Overall accuracy (across all 900 videos)
Accuracy by duration category (short, medium, long)
Accuracy by visual domain (knowledge, film and television, sports, artistic performance, life record, multilingual)
Accuracy by task type (perception, temporal, spatial, reasoning, counting, etc.)

The evaluation pipeline uses automated scripts without reliance on third-party AI models for scoring, ensuring reproducibility and consistency.

Prompt Format

Models receive a standardized prompt that includes the subtitle context (when applicable), the question text, and the four answer options labeled A through D. The model is asked to respond with a single letter corresponding to its chosen answer.

Model Performance

Original Paper Results (2024)

The initial Video-MME paper evaluated a range of commercial and open-source models. The results revealed a significant performance hierarchy and several notable patterns.

Commercial Models:

Model	Short (w/o subs)	Short (w/ subs)	Medium (w/o subs)	Medium (w/ subs)	Long (w/o subs)	Long (w/ subs)	Overall (w/o subs)	Overall (w/ subs)
Gemini 1.5 Pro	81.7%	84.5%	74.3%	81.0%	67.4%	77.4%	75.0%	81.3%
GPT-4o	80.0%	82.8%	70.3%	76.6%	65.3%	72.1%	71.9%	77.2%
Gemini 1.5 Flash	78.8%	79.8%	68.8%	74.7%	61.1%	68.8%	70.3%	75.0%
GPT-4V	70.5%	73.2%	55.8%	59.7%	53.5%	56.9%	59.9%	63.3%

Open-Source Video Models:

Model	Short (w/o subs)	Short (w/ subs)	Medium (w/o subs)	Medium (w/ subs)	Long (w/o subs)	Long (w/ subs)	Overall (w/o subs)	Overall (w/ subs)
VILA-1.5	68.1%	68.9%	58.1%	57.4%	50.8%	52.0%	59.0%	59.4%
LLaVA-NeXT-Video	61.7%	65.1%	50.1%	52.2%	44.3%	47.2%	52.0%	54.9%
VideoChat2-Mistral	48.3%	52.8%	37.0%	39.4%	33.2%	39.2%	39.5%	43.8%
ShareGPT4Video	48.3%	53.6%	36.3%	39.3%	35.0%	37.9%	39.9%	43.6%
Chat-UniVi-V1.5	45.7%	51.2%	40.3%	44.6%	35.8%	41.8%	40.6%	45.9%
ST-LLM	45.7%	48.4%	36.8%	41.4%	31.3%	36.9%	37.9%	42.3%
Video-LLaVA	45.3%	46.1%	38.0%	40.7%	36.2%	38.1%	39.9%	41.6%

Image-Based Models (Applied to Video):

Model	Short (w/o subs)	Short (w/ subs)	Overall (w/o subs)	Overall (w/ subs)
InternVL-Chat-V1.5	60.2%	61.7%	50.7%	52.4%
Qwen-VL-Max	55.8%	57.6%	51.3%	51.2%
Qwen-VL-Chat	46.9%	47.3%	41.1%	41.9%

Updated Leaderboard Results (2025-2026)

As models have improved rapidly, the Video-MME leaderboard has continued to evolve. Notable recent scores include:

Model	Organization	Parameters	Overall Score
Kimi K2.5	Moonshot AI	1.0T	87.4%
Gemini 2.5 Pro	Google	Not disclosed	84.8%
video-SALMONN 2+	Tsinghua / ByteDance	72B	81.6% (w/ subs)
Gemini 1.5 Pro	Google	Not disclosed	81.3% (w/ subs)
Gemini 1.5 Flash	Google	Not disclosed	76.1%
Qwen3 VL 30B A3B Instruct	Alibaba	31B	74.5%
Qwen3 VL 30B A3B Thinking	Alibaba	31B	73.3%
Qwen3 VL 8B Thinking	Alibaba	9B	71.8%
Qwen3 VL 8B Instruct	Alibaba	9B	71.4%
GPT-4.1	OpenAI	Not disclosed	72.0% (long, w/o subs)
Gemini 1.5 Flash 8B	Google	8B	66.2%
Phi-4-multimodal-instruct	Microsoft	6B	55.0%

OpenAI specifically highlighted Video-MME as an "industry standard measure" of multimodal long-context ability when introducing GPT-4.1 in April 2025, reporting that GPT-4.1 scored 72.0% on the long video, no subtitles category, a 6.7 percentage point improvement over GPT-4o's performance on the same subset.

Key Findings

Performance Degrades with Video Duration

One of Video-MME's most consistent findings is that all models, both commercial and open-source, show declining accuracy as video duration increases. In the original evaluation, Gemini 1.5 Pro dropped from 81.7% on short videos to 67.4% on long videos (without subtitles), a decline of 14.3 percentage points. This pattern held across every model tested, highlighting long-form video understanding as a fundamental challenge for current architectures.

The performance gap between commercial and open-source models also widens with duration. While the gap on short videos is moderate (about 12-15 percentage points between Gemini 1.5 Pro and VILA-1.5), it grows more substantial on long videos (approximately 17 percentage points), suggesting that processing and reasoning over extended temporal contexts is where proprietary models hold their greatest advantage.

Subtitles and Audio Significantly Improve Performance

The integration of subtitle information consistently boosted model performance across all duration categories. For Gemini 1.5 Pro, subtitles improved overall accuracy by 6.3 percentage points (from 75.0% to 81.3%). The benefit was most pronounced for long videos, where subtitle information provided a 10.0 percentage point improvement.

This finding carries practical implications: it suggests that multi-modal integration, particularly the combination of visual and textual inputs, is essential for robust video understanding. Models that can effectively leverage subtitle or transcript information have a meaningful advantage, especially on longer content where visual information alone becomes harder to track.

The multilingual category showed the largest improvements from subtitle integration, with gains of up to 16.7 percentage points on long videos. This makes sense because subtitles bridge language barriers that would otherwise make non-English video content particularly challenging for models trained primarily on English data.

Image Models Perform Surprisingly Well

An unexpected finding was that image-based multimodal models (designed for static image understanding and applied to individual video frames) achieved performance comparable to purpose-built video models. InternVL-Chat-V1.5, an image model, scored 50.7% overall without subtitles, competitive with video-specific models like VideoChat2-Mistral (39.5%) and even approaching some video models.

This result led the authors to conclude that "image understanding is the foundation of video understanding," suggesting that strong visual perception on individual frames provides a solid baseline for video tasks, even without explicit temporal modeling. However, the gap between image models and the best video models (and especially commercial models with longer context windows) indicates that temporal reasoning remains crucial for achieving top performance.

Counting as a Bottleneck

Across virtually all models evaluated, counting tasks (estimating the number of objects, people, or events in a video) emerged as a joint bottleneck. Both commercial and open-source models struggled significantly with questions requiring precise counting, suggesting that this capability demands improvements in both visual processing (tracking individual entities across frames) and numerical reasoning.

Performance Varies by Domain

Model performance is not uniform across the six visual domains. In the original evaluation, artistic performance tended to yield the highest scores (81.5% for Gemini 1.5 Pro with subtitles), while sports competition produced the lowest (77.7%). This variation reflects the different cognitive demands of each domain: sports videos require tracking fast-moving objects and understanding domain-specific rules, while artistic performances may be more amenable to visual and audio cues.

Comparison with Other Benchmarks

Video-MME occupies a distinct position in the landscape of video understanding benchmarks:

Feature	Video-MME	MVBench	EgoSchema	Video-Bench	TempCompass
Number of videos	900	4,000 (clips)	5,000	7,585	410
Maximum video duration	1 hour	~16 seconds	180 seconds	~60 seconds	~120 seconds
Average video duration	~17 minutes	~16 seconds	180 seconds	~56 seconds	~30 seconds
Annotation method	Manual (expert)	Automatic	Manual	Mixed	Manual
Subtitle support	Yes	No	No	No	No
Audio support	Yes	No	No	No	No
Number of QA pairs	2,700	4,000	5,000	17,036	7,540
Open domain	Yes	Yes	No (egocentric)	Partial	Yes
Duration categories	Short, medium, long	Single	Single	Single	Single

Video-MME's primary advantages are its coverage of long-form video content, its multi-modal evaluation approach (testing with and without subtitles and audio), its expert manual annotation quality, and its systematic three-tier duration design. While other benchmarks may have more questions or videos, none combine the temporal range, modal breadth, and annotation rigor of Video-MME.

Industry Adoption

Video-MME has achieved rapid adoption as a standard benchmark across the AI industry. Several major model releases have used Video-MME scores as primary evidence of video understanding capabilities:

OpenAI cited Video-MME as an "industry standard measure of multimodal long context ability" when announcing GPT-4.1 in April 2025, highlighting the model's 72.0% score on long videos without subtitles.
Google reported Gemini 2.5 Pro's 84.8% score on Video-MME as a key benchmark result, and both Gemini 1.5 Pro and Gemini 3 Pro releases have featured Video-MME performance prominently.
Moonshot AI showcased Kimi K2.5's top-ranking 87.4% score on Video-MME.
Alibaba used Video-MME results to demonstrate the video understanding capabilities of the Qwen3 VL model family.

The benchmark is integrated into major evaluation frameworks including VLMEvalKit and LMMs-Eval, and its dataset is publicly available through HuggingFace (LMMS-Lab), making it accessible for researchers and developers to evaluate their own models.

Technical Details

Data Availability

The Video-MME dataset, including videos, annotations, subtitles, and audio files, is publicly available through the project's HuggingFace repository under the LMMS-Lab organization. Evaluation scripts are provided through the official GitHub repository (MME-Benchmarks/Video-MME), and the benchmark is compatible with popular evaluation toolkits.

Evaluation Tools

Researchers can evaluate models on Video-MME using:

VLMEvalKit: An open-source evaluation toolkit for large multi-modality models maintained by OpenCompass, which includes Video-MME as one of its supported benchmarks.
LMMs-Eval: Another evaluation framework that supports Video-MME evaluation.
Custom scripts: The official repository provides evaluation scripts that compute accuracy across all relevant dimensions (duration, domain, task type) without requiring third-party AI models.

Output files follow a standardized JSON structure, ensuring consistent and reproducible evaluation across different models and configurations.

Prompt Template

The standard evaluation prompt includes:

Subtitle context (when evaluating with subtitles), providing the textual content corresponding to the selected video frames
The question text
Four answer options (A, B, C, D)
An instruction to respond with only the letter of the best answer

This standardized format ensures fair comparison across models with different architectures and input processing pipelines.

Authors and Affiliations

Video-MME was created by a team of 21 researchers from multiple leading institutions:

Nanjing University (State Key Laboratory for Novel Software Technology): Chaoyou Fu, Caifeng Shan (corresponding author)
Chinese Academy of Sciences (CASIA, Institute of Automation): Mengdan Zhang, Ran He (corresponding author)
State Key Laboratory of Cognitive Intelligence: Yuhan Dai, Sirui Zhao, Tong Xu, Enhong Chen
University of Hong Kong (HKU): Lei Li, Xing Sun
Peking University (PKU): Shuhuai Ren
Chinese University of Hong Kong (CUHK): Renrui Zhang, Yanwei Li
East China Normal University (ECNU): Zihan Wang, Chenyu Zhou, Shaohui Lin
Xiamen University and affiliated labs: Yongdong Luo, Yunhang Shen, Peixian Chen, Ke Li, Xiawu Zheng

The paper was first posted on arXiv in May 2024 (arXiv:2405.21075) and was subsequently accepted at CVPR 2025.

Future Directions

The Video-MME authors and the broader research community have identified several directions for advancing video understanding based on the benchmark's findings:

Long-context architectures: The consistent performance degradation on long videos points to the need for architectural innovations in handling extended context windows. Approaches like ring attention and adaptive key-frame identification through temporal Q-Formers represent promising research directions.
Training data for temporal reasoning: The gap between short and long video performance suggests that current training datasets may not adequately emphasize complex temporal reasoning scenarios. Constructing focused training data for multi-step temporal reasoning could yield significant improvements.
Multi-modal integration: The substantial performance gains from subtitle integration indicate untapped potential in better fusing visual, textual, and audio modalities. Future work may explore more sophisticated approaches to cross-modal attention and information integration.
Counting and tracking: The identification of counting as a universal bottleneck highlights the need for improved object tracking and numerical reasoning capabilities in video understanding models.
Frame selection strategies: Current evaluation typically uses uniform frame sampling, but the certificate length analysis suggests that adaptive, content-aware frame selection could significantly improve efficiency and accuracy, particularly for long videos.

References

Fu, C., Dai, Y., Luo, Y., Li, L., Ren, S., Zhang, R., Wang, Z., Zhou, C., Shen, Y., Zhang, M., Chen, P., Li, Y., Lin, S., Zhao, S., Li, K., Xu, T., Zheng, X., Chen, E., Shan, C., He, R., & Sun, X. (2024). "Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis." arXiv:2405.21075. https://arxiv.org/abs/2405.21075
Fu, C. et al. (2025). "Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025), pp. 24108-24118.
Video-MME Official Project Page. https://video-mme.github.io/home_page.html
Video-MME GitHub Repository. https://github.com/MME-Benchmarks/Video-MME
OpenAI. (2025). "Introducing GPT-4.1 in the API." https://openai.com/index/gpt-4-1/
Video-MME Leaderboard. https://llm-stats.com/benchmarks/video-mme
Video-MME Dataset on HuggingFace (LMMS-Lab). https://huggingface.co/datasets/lmms-lab/Video-MME

Background and Motivation

Dataset Composition

Scale and Structure

Video Duration Categories

Visual Domains and Subfields

Multi-Modal Data

Annotation Methodology

Question Design Process

Question and Task Types

Word Count Statistics

Evaluation Methodology

Input Configuration

Scoring

Prompt Format

Model Performance

Original Paper Results (2024)

Updated Leaderboard Results (2025-2026)

Key Findings

Performance Degrades with Video Duration

Subtitles and Audio Significantly Improve Performance

Image Models Perform Surprisingly Well

Counting as a Bottleneck

Performance Varies by Domain

Comparison with Other Benchmarks

Industry Adoption

Technical Details

Data Availability

Evaluation Tools

Prompt Template

Authors and Affiliations

Future Directions

See Also

References

Improve this article

Related Articles

EgoSchema

CLIP Score

MMMU-Pro

Machine learning terms/Computer Vision

Photography

LeNet

Background and Motivation

Dataset Composition

Scale and Structure

Video Duration Categories

Visual Domains and Subfields

Multi-Modal Data

Annotation Methodology

Question Design Process

Question and Task Types

Word Count Statistics

Evaluation Methodology

Input Configuration

Scoring

Prompt Format

Model Performance

Original Paper Results (2024)

Updated Leaderboard Results (2025-2026)

Key Findings

Performance Degrades with Video Duration

Subtitles and Audio Significantly Improve Performance

Image Models Perform Surprisingly Well

Counting as a Bottleneck

Performance Varies by Domain

Comparison with Other Benchmarks

Industry Adoption

Technical Details

Data Availability

Evaluation Tools

Prompt Template

Authors and Affiliations

Future Directions

See Also

References

Related Articles

EgoSchema

CLIP Score

MMMU-Pro

Machine learning terms/Computer Vision

Photography