# Video-MME

> Source: https://aiwiki.ai/wiki/video_mme
> Updated: 2026-06-25
> Categories: AI Benchmarks, Computer Vision, Multimodal AI
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

Video-MME (Video Multi-Modal Evaluation) is a benchmark for testing how well [multimodal](/wiki/multimodal_ai) [large language models](/wiki/large_language_model) (MLLMs) understand video, built from 900 manually selected videos totaling 254 hours and 2,700 expert-written multiple-choice question-answer pairs [1]. Introduced in May 2024, it was the first full-spectrum evaluation suite for video MLLMs, spanning short, medium, and long clips ranging from 11 seconds to 1 hour, across 6 visual domains and 30 subfields, with subtitles and audio supplied alongside the video frames [1]. The benchmark has since become a standard reference for reporting video understanding, cited by major releases from [OpenAI](/wiki/openai), [Google](/wiki/google_deepmind), and [Moonshot AI](/wiki/moonshot_ai) [3][5].

The paper, "Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis," was written by a team of 21 researchers led by Chaoyou Fu, drawn from Nanjing University, the University of Science and Technology of China, the Chinese Academy of Sciences, Xiamen University, Peking University, the University of Hong Kong, the Chinese University of Hong Kong, East China Normal University, and other institutions [1]. It was accepted at [CVPR](/wiki/cvpr) 2025 [2]. The authors describe Video-MME as "the first-ever full-spectrum, Multi-Modal Evaluation benchmark of MLLMs in Video analysis," distinguished by "1) Diversity in video types, spanning 6 primary visual domains with 30 subfields; 2) Duration in temporal dimension, encompassing both short-, medium-, and long-term videos, ranging from 11 seconds to 1 hour; 3) Breadth in data modalities, integrating multi-modal inputs besides video frames, including subtitles and audios; 4) Quality in annotations, utilizing rigorous manual labeling by expert annotators" [1].

## What is Video-MME used for?

Video-MME is used to measure and compare the video understanding of multimodal models under controlled, reproducible conditions. Because every question is a 4-option multiple-choice item scored by exact-match accuracy, results are directly comparable across models, and because the same questions are reused with and without subtitles and audio, researchers can isolate how much each modality contributes [1]. Model developers use it as headline evidence of video and long-context capability: [OpenAI](/wiki/openai) called it an "industry standard measure of multimodal long context ability" when introducing [GPT-4.1](/wiki/gpt_4_1) [5]. Its 900 videos span 254 hours and its longest clips run a full hour, so it doubles as a stress test for long-form temporal reasoning that earlier short-clip [benchmark](/wiki/benchmark)s could not probe [1].

## Why was Video-MME created?

By 2024, image-based evaluation of multimodal models had matured. Suites such as [MME](/wiki/mme) and [MMBench](/wiki/mmbench) provided structured, multiple-choice tests of perception and reasoning over still images, and they had become standard reference points for comparing models [1]. Video understanding, however, lacked a comparably comprehensive yardstick. Existing video benchmarks such as [MVBench](/wiki/mvbench), EgoSchema, and Video-Bench primarily focused on short clips, typically under three minutes in length [1]. This narrow temporal scope failed to capture the challenges that arise when models must process and reason over longer video content, where maintaining context, tracking narrative threads, and synthesizing information across extended timelines become essential.

Video also introduces challenges absent from single images: a model must integrate information across time, track entities and events over long spans, and fuse what it sees with what is spoken or written on screen [1]. Additionally, many prior benchmarks relied on videos sourced from existing academic datasets rather than open-domain content, limiting the diversity of real-world scenarios tested. Most also evaluated only visual frame understanding, ignoring the role of other modalities like subtitles and audio in comprehensive video comprehension [1]. The authors frame the problem directly: "the predominant focus remains on developing their capabilities in static image understanding," leaving "the potential of MLLMs in processing sequential visual data still insufficiently explored" [1].

The Video-MME authors identified several specific gaps in previous work [1]:

- **Limited video duration**: The longest videos in prior benchmarks like EgoSchema reached only 180 seconds, leaving long-form video understanding entirely untested.
- **Narrow domain coverage**: Many benchmarks focused on specific video types (egocentric videos, movie clips, or instructional content) rather than spanning the full range of video content found online.
- **Single-modality evaluation**: Previous benchmarks rarely assessed how additional modalities such as subtitles and audio tracks affect model performance.
- **Automated annotation**: Several prior datasets relied on template-based or model-generated question-answer pairs, which can introduce artifacts and fail to capture nuanced video understanding.

These limitations motivated the creation of Video-MME as a "full-spectrum" evaluation tool that systematically addresses each shortcoming through careful dataset design and annotation methodology, and that lets researchers isolate the contribution of each modality [1].

## How is Video-MME structured?

### Scale and structure

Video-MME comprises 900 videos with a combined duration of 254 hours [1]. Each video is paired with exactly 3 expert-crafted questions, yielding 2,700 question-answer pairs in total. All questions follow a multiple-choice format with 4 answer options (A through D), and models are scored on accuracy [1]. The dataset also includes 744 subtitle files, with all long-duration videos accompanied by subtitles, and 900 corresponding audio files [1]. All videos and annotations were hand-selected and written from scratch by the authors rather than reused from prior datasets, a choice the team highlights as a safeguard against test-set leakage from models' training data [1].

### Video duration categories

A defining feature of Video-MME is its systematic coverage of three temporal categories, with 300 videos in each tier [1]:

| Duration Category | Time Range | Average Length | Description |
|---|---|---|---|
| Short | 11 seconds to 2 minutes | ~81 seconds | Brief clips capturing single events or actions |
| Medium | 4 to 15 minutes | ~520 seconds | Mid-length videos requiring tracking of multiple events |
| Long | 30 to 60 minutes | ~2,471 seconds | Extended recordings demanding sustained context and temporal reasoning |

The overall duration range spans from 11 seconds to 1 hour, and each duration category contains 300 videos, ensuring balanced representation [1]. The long-video tier in particular probes whether a model can reason over content far exceeding the few-second clips common in earlier benchmarks.

The concept of "certificate length" is central to Video-MME's design. A certificate is defined as the minimum set of video sub-clips necessary and sufficient to convince a human verifier that the marked annotation is correct [1]. The substantially longer certificate lengths for medium and long videos, compared to benchmarks like EgoSchema (which has a certificate length of roughly 100 seconds), confirm that Video-MME genuinely requires deep engagement with extended video content rather than relying on isolated frames [1].

### Visual domains and subfields

Video-MME spans 6 primary visual domains, further divided into 30 fine-grained subfields [1]:

| Domain | Example Subfields |
|---|---|
| Knowledge | Astronomy, technology, science education, documentaries |
| Film and Television | News reports, TV dramas, movie clips, interviews |
| Sports Competition | Football, basketball, esports, track and field |
| Artistic Performance | Magic shows, music performances, dance, theater |
| Life Record | Fashion, cooking, travel vlogs, daily activities |
| Multilingual | Videos in languages other than English, covering varied cultural contexts |

All videos were sourced from YouTube across 30 categorized tags, ensuring a wide variety of real-world content [1]. This open-domain approach contrasts with benchmarks that draw from curated academic datasets, providing a more realistic assessment of how models handle the diversity of video content encountered in practical applications.

### Multi-modal data

Beyond video frames, Video-MME integrates two additional modalities [1]:

- **Subtitles**: 744 of the 900 videos include subtitle files. All long-duration videos have subtitles, reflecting the practical reality that longer videos (documentaries, lectures, TV shows) typically include spoken dialogue or narration. The average subtitle length grows substantially with video duration: approximately 199 words for short videos, 1,426 words for medium videos, and 6,516 words for long videos [1].
- **Audio**: All 900 videos include audio tracks that may contain speech, music, sound effects, and environmental sounds relevant to understanding the video content [1].

This multi-modal design allows researchers to evaluate models under different input configurations: video frames only, video frames with subtitles, and video frames with subtitles and audio. It quantifies how much each non-visual signal contributes and allows comparison between models that can or cannot process audio [1].

## How are Video-MME questions annotated?

### Question design process

Video-MME employs a rigorous three-step annotation pipeline [1]:

1. **Video Collection**: Videos are sourced from YouTube across the 30 categorized tags, with deliberate selection across the three duration categories. The collection process prioritizes diversity in content type, visual complexity, and linguistic context.

2. **Question-Answer Annotation**: Expert annotators with strong English proficiency and extensive research experience watch each video in its entirety before designing 3 questions per video. Each question includes 4 potential answer options. Annotators are instructed to create questions that genuinely require understanding the video content rather than relying on common knowledge or visual shortcuts.

3. **Quality Review**: A multi-stage verification process ensures annotation quality. Cross-annotator verification checks for clarity and logical consistency. Questions are then filtered using [Gemini 1.5](/wiki/gemini_1_5) Pro in a question-only setting (without access to the video) to confirm that the questions cannot be answered from text alone. In this filtering step, the model achieved less than 15% accuracy, confirming that the questions truly depend on video understanding [1]. Questions that failed quality checks were returned to annotators for revision.

### Question and task types

Video-MME encompasses 12 distinct task types that span multiple cognitive levels [1]:

| Task Category | Examples |
|---|---|
| Perception | Action recognition, object recognition, attribute perception |
| Temporal Understanding | Temporal perception, temporal localization, event sequencing |
| Spatial Reasoning | Spatial relationship identification, scene layout understanding |
| Information Synthesis | Information synopsis, plot summarization, cross-segment reasoning |
| Counting | Object counting, event frequency estimation |
| Complex Reasoning | Causal inference, multi-step logical deduction |

Shorter videos tend to emphasize perception-related tasks (identifying objects, actions, or attributes visible in a few frames), while longer videos predominantly feature tasks related to temporal reasoning, information synthesis, and complex multi-step reasoning that requires integrating information from across the full video [1].

### Word count statistics

The complexity of questions and answers scales with video duration [1]:

| Metric | Short Videos | Medium Videos | Long Videos |
|---|---|---|---|
| Average question word count | 11.5 | 12.2 | 14.5 |
| Average option word count | 17.2 | 20.6 | 31.0 |
| Average answer word count | 4.0 | 5.0 | 7.5 |
| Average subtitle word count | 198.6 | 1,425.6 | 6,515.6 |

The substantial increase in option and answer length for long videos reflects the greater complexity and specificity required to distinguish correct answers in extended content scenarios.

## How are models scored on Video-MME?

### Input configuration

Models are evaluated under two primary settings [1]:

- **Without subtitles (w/o subs)**: Models receive only video frames (and optionally audio) as input. This tests pure visual understanding capability.
- **With subtitles (w/ subs)**: Models receive video frames along with corresponding subtitle text. This tests the model's ability to integrate visual and textual information.

For frame-based evaluation, the benchmark specifies that if a model extracts N frames per video, the N subtitles corresponding to those frame timestamps should be provided when evaluating with subtitles [1]. Because the same questions are reused across input settings, the modality ablations are directly comparable [1].

### Scoring

Evaluation uses straightforward accuracy: the percentage of questions for which the model selects the correct answer option. Scores are reported across multiple dimensions [1]:

- Overall accuracy (across all 900 videos)
- Accuracy by duration category (short, medium, long)
- Accuracy by visual domain (knowledge, film and television, sports, artistic performance, life record, multilingual)
- Accuracy by task type (perception, temporal, spatial, reasoning, counting, etc.)

The evaluation pipeline uses automated scripts without reliance on third-party AI models for scoring, ensuring reproducibility and consistency [4].

### Prompt format

Models receive a standardized prompt that includes the subtitle context (when applicable), the question text, and the four answer options labeled A through D. The model is asked to respond with a single letter corresponding to its chosen answer [1].

## Which models lead Video-MME?

### Original paper results (2024)

The initial Video-MME paper evaluated a range of commercial and open-source models, including the [GPT-4](/wiki/gpt_4) family ([GPT-4V](/wiki/gpt_4v) and [GPT-4o](/wiki/gpt_4o)), Gemini 1.5 Pro and Flash, and open-source video models such as [LLaVA](/wiki/llava)-NeXT-Video, VILA, and Video-LLaVA [1]. Gemini 1.5 Pro was the strongest model evaluated, reaching 75.0% overall accuracy using video frames alone, surpassing GPT-4V by 15.1 percentage points and GPT-4o by 3.1 percentage points in that setting [1]. The paper states that "Gemini 1.5 Pro is the best-performing commercial model, significantly outperforming the open-source models" [1]. Commercial models in general outperformed the open-source systems, which clustered well below the leaders [1].

**Commercial Models** [1]:

| Model | Short (w/o subs) | Short (w/ subs) | Medium (w/o subs) | Medium (w/ subs) | Long (w/o subs) | Long (w/ subs) | Overall (w/o subs) | Overall (w/ subs) |
|---|---|---|---|---|---|---|---|---|
| [Gemini](/wiki/gemini) 1.5 Pro | 81.7% | 84.5% | 74.3% | 81.0% | 67.4% | 77.4% | 75.0% | 81.3% |
| [GPT-4o](/wiki/gpt_4o) | 80.0% | 82.8% | 70.3% | 76.6% | 65.3% | 72.1% | 71.9% | 77.2% |
| Gemini 1.5 Flash | 78.8% | 79.8% | 68.8% | 74.7% | 61.1% | 68.8% | 70.3% | 75.0% |
| [GPT-4V](/wiki/gpt_4v) | 70.5% | 73.2% | 55.8% | 59.7% | 53.5% | 56.9% | 59.9% | 63.3% |

**Open-Source Video Models** [1]:

| Model | Short (w/o subs) | Short (w/ subs) | Medium (w/o subs) | Medium (w/ subs) | Long (w/o subs) | Long (w/ subs) | Overall (w/o subs) | Overall (w/ subs) |
|---|---|---|---|---|---|---|---|---|
| VILA-1.5 | 68.1% | 68.9% | 58.1% | 57.4% | 50.8% | 52.0% | 59.0% | 59.4% |
| [LLaVA](/wiki/llava)-NeXT-Video | 61.7% | 65.1% | 50.1% | 52.2% | 44.3% | 47.2% | 52.0% | 54.9% |
| VideoChat2-Mistral | 48.3% | 52.8% | 37.0% | 39.4% | 33.2% | 39.2% | 39.5% | 43.8% |
| ShareGPT4Video | 48.3% | 53.6% | 36.3% | 39.3% | 35.0% | 37.9% | 39.9% | 43.6% |
| Chat-UniVi-V1.5 | 45.7% | 51.2% | 40.3% | 44.6% | 35.8% | 41.8% | 40.6% | 45.9% |
| ST-LLM | 45.7% | 48.4% | 36.8% | 41.4% | 31.3% | 36.9% | 37.9% | 42.3% |
| Video-LLaVA | 45.3% | 46.1% | 38.0% | 40.7% | 36.2% | 38.1% | 39.9% | 41.6% |

**Image-Based Models (Applied to Video)** [1]:

| Model | Short (w/o subs) | Short (w/ subs) | Overall (w/o subs) | Overall (w/ subs) |
|---|---|---|---|---|
| InternVL-Chat-V1.5 | 60.2% | 61.7% | 50.7% | 52.4% |
| [Qwen](/wiki/qwen)-VL-Max | 55.8% | 57.6% | 51.3% | 51.2% |
| Qwen-VL-Chat | 46.9% | 47.3% | 41.1% | 41.9% |

### Updated leaderboard results (2025-2026)

As models have improved rapidly, the Video-MME leaderboard has continued to evolve and reported scores have risen substantially, with the top entries now well above the 75.0% mark set by Gemini 1.5 Pro in 2024 [3][6]. Because the benchmark separates the with-subtitles and without-subtitles settings, leaderboard entries are typically reported under both conditions [3]. As of mid-2026, the leaderboard is led by Chinese models, with several systems clustered near 87-88% overall accuracy [6]:

| Model | Organization | Parameters | Overall Score |
|---|---|---|---|
| Qwen3.7-Plus | [Alibaba](/wiki/alibaba_ai) | Not disclosed | 88.0% |
| MiMo-V2.5 | [Xiaomi](/wiki/xiaomi_ai) | 311B | 87.7% |
| Kimi K2.5 | [Moonshot AI](/wiki/moonshot_ai) | 1.0T | 87.4% |
| MiniMax M3 | [MiniMax](/wiki/minimax) | Not disclosed | 85.4% |
| [Gemini](/wiki/gemini) 2.5 Pro | [Google](/wiki/google_deepmind) | Not disclosed | 84.8% |
| Qwen3.6 Plus | Alibaba | Not disclosed | 84.2% |
| Gemini 1.5 Pro | Google | Not disclosed | 78.6% |
| Nova 2 Omni | [Amazon](/wiki/amazon) | Not disclosed | 77.9% |
| Gemini 1.5 Flash | Google | Not disclosed | 76.1% |
| [Qwen](/wiki/qwen)3 VL 30B A3B Instruct | Alibaba | 31B | 74.5% |
| Qwen3 VL 8B Instruct | Alibaba | 9B | 71.4% |
| Phi-4-multimodal-instruct | [Microsoft](/wiki/microsoft_ai) | 6B | 55.0% |

These leaderboard figures are largely self-reported by the submitting organizations rather than independently re-run, so cross-model comparisons should account for differences in frame sampling and input configuration [6]. OpenAI specifically highlighted Video-MME as an "industry standard measure" of multimodal long-context ability when introducing GPT-4.1 in April 2025, reporting that GPT-4.1 scored 72.0% on the long video, no subtitles category, a 6.7 percentage point improvement over GPT-4o's performance on the same subset [5].

## What did Video-MME reveal about model capabilities?

### Performance degrades with video duration

One of Video-MME's most consistent findings is that all models, both commercial and open-source, show declining accuracy as video duration increases [1]. In the original evaluation, Gemini 1.5 Pro dropped from 81.7% on short videos to 67.4% on long videos (without subtitles), a decline of 14.3 percentage points. This pattern held across every model tested, highlighting long-form video understanding as a fundamental challenge for current architectures [1].

The performance gap between commercial and open-source models also widens with duration. While the gap on short videos is moderate (about 12-15 percentage points between Gemini 1.5 Pro and VILA-1.5), it grows more substantial on long videos (approximately 17 percentage points), suggesting that processing and reasoning over extended temporal contexts is where proprietary models hold their greatest advantage [1].

### Subtitles and audio significantly improve performance

The integration of subtitle information consistently boosted model performance across all duration categories [1]. For Gemini 1.5 Pro, subtitles improved overall accuracy by 6.3 percentage points (from 75.0% to 81.3%). The benefit was most pronounced for long videos, where subtitle information provided a 10.0 percentage point improvement [1]. Audio input provided a smaller but real additional gain, confirming that much of the information in real video is carried by speech and on-screen text rather than visuals alone [1].

This finding carries practical implications: it suggests that multi-modal integration, particularly the combination of visual and textual inputs, is essential for robust video understanding. Models that can effectively leverage subtitle or transcript information have a meaningful advantage, especially on longer content where visual information alone becomes harder to track.

The multilingual category showed the largest improvements from subtitle integration, with gains of up to 16.7 percentage points on long videos [1]. This makes sense because subtitles bridge language barriers that would otherwise make non-English video content particularly challenging for models trained primarily on English data.

### Image models perform surprisingly well

An unexpected finding was that image-based multimodal models (designed for static image understanding and applied to individual video frames) achieved performance comparable to purpose-built video models [1]. InternVL-Chat-V1.5, an image model, scored 50.7% overall without subtitles, competitive with video-specific models like VideoChat2-Mistral (39.5%) and even approaching some video models.

This result led the authors to conclude that "image understanding is the foundation of video understanding," suggesting that strong visual perception on individual frames provides a solid baseline for video tasks, even without explicit temporal modeling [1]. However, the gap between image models and the best video models (and especially commercial models with longer context windows) indicates that temporal reasoning remains crucial for achieving top performance.

### Counting as a bottleneck

Across virtually all models evaluated, counting tasks (estimating the number of objects, people, or events in a video) emerged as a joint bottleneck [1]. Both commercial and open-source models struggled significantly with questions requiring precise counting, suggesting that this capability demands improvements in both visual processing (tracking individual entities across frames) and numerical reasoning.

### Performance varies by domain

Model performance is not uniform across the six visual domains. In the original evaluation, artistic performance tended to yield the highest scores (81.5% for Gemini 1.5 Pro with subtitles), while sports competition produced the lowest (77.7%) [1]. This variation reflects the different cognitive demands of each domain: sports videos require tracking fast-moving objects and understanding domain-specific rules, while artistic performances may be more amenable to visual and audio cues.

## How does Video-MME compare to other benchmarks?

Video-MME occupies a distinct position in the landscape of video understanding benchmarks [1]:

| Feature | Video-MME | MVBench | EgoSchema | Video-Bench | TempCompass |
|---|---|---|---|---|---|
| Number of videos | 900 | 4,000 (clips) | 5,000 | 7,585 | 410 |
| Maximum video duration | 1 hour | ~16 seconds | 180 seconds | ~60 seconds | ~120 seconds |
| Average video duration | ~17 minutes | ~16 seconds | 180 seconds | ~56 seconds | ~30 seconds |
| Annotation method | Manual (expert) | Automatic | Manual | Mixed | Manual |
| Subtitle support | Yes | No | No | No | No |
| Audio support | Yes | No | No | No | No |
| Number of QA pairs | 2,700 | 4,000 | 5,000 | 17,036 | 7,540 |
| Open domain | Yes | Yes | No (egocentric) | Partial | Yes |
| Duration categories | Short, medium, long | Single | Single | Single | Single |

Several contemporaneous benchmarks targeted overlapping goals, and Video-MME is often discussed in relation to them [1]:

- MVBench, from the InternVideo group, focuses on temporal reasoning through about 20 task categories, but its clips are short, averaging roughly 16 seconds, and are assembled from existing datasets.
- MLVU (Multi-task Long Video Understanding) and similar long-video suites emphasize extended content but vary in domain coverage and annotation provenance.
- EgoSchema concentrates on long-form egocentric (first-person) video, with multiple-choice questions over clips of 180 seconds.
- The image-only predecessors MME and MMBench established the multiple-choice MLLM evaluation methodology that Video-MME extends into the temporal dimension. The image-focused [MMMU](/wiki/mmmu) benchmark pursues a parallel goal for college-level multimodal reasoning over diagrams and figures.

Video-MME's primary advantages are its coverage of long-form video content, its multi-modal evaluation approach (testing with and without subtitles and audio), its expert manual annotation quality, and its systematic three-tier duration design [1]. While other benchmarks may have more questions or videos, none combine the temporal range, modal breadth, and annotation rigor of Video-MME. These properties made it one of the standard reference evaluations for video MLLMs soon after its mid-2024 release, and the "MME" naming situates it within a broader family of multimodal evaluation benchmarks that share a focus on structured, objective scoring of MLLM capabilities [2][3].

## What is Video-MME-v2?

Video-MME-v2 is a 2026 successor benchmark from much of the original team, led again by Chaoyou Fu, posted to arXiv in April 2026 (arXiv:2604.05015) under the title "Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding" [8]. It responds to leaderboard saturation on the original Video-MME, where top models had pushed past 85% overall accuracy, by raising both annotation rigor and evaluation difficulty [6][8]. The dataset was built through "a rigorously controlled human annotation pipeline, involving 12 annotators and 50 independent reviewers, backed by 3,300 human-hours and up to 5 rounds of quality assurance" [8].

Two design changes set v2 apart from the original [8]:

- **Progressive tri-level hierarchy**: questions are organized into three ascending complexity levels, from multi-point visual information aggregation, to temporal dynamics modeling, to complex multimodal reasoning, so the benchmark can show where a model's reasoning chain breaks down.
- **Group-based non-linear scoring**: instead of conventional per-question accuracy, v2 scores related questions as groups, rewarding consistency and coherent multi-step reasoning while penalizing fragmented or guess-based correctness.

Under this harder protocol the headroom is large: the paper reports that even the strongest model tested, Gemini-3-Pro, reached only 49.4 against a human-expert score of 90.7, and that errors in lower-level visual and temporal tasks propagate upward to cap higher-level reasoning [8]. The original Video-MME remains the more widely reported figure in 2026 model cards, but v2 signals where the harder frontier of video evaluation is moving.

## Industry adoption

Video-MME has achieved rapid adoption as a standard benchmark across the AI industry. Several major model releases have used Video-MME scores as primary evidence of video understanding capabilities:

- **OpenAI** cited Video-MME as an "industry standard measure of multimodal long context ability" when announcing GPT-4.1 in April 2025, highlighting the model's 72.0% score on long videos without subtitles [5].
- **Google** reported [Gemini](/wiki/gemini) 2.5 Pro's 84.8% score on Video-MME as a key benchmark result, and both Gemini 1.5 Pro and [Gemini 3](/wiki/gemini_3) Pro releases have featured Video-MME performance prominently [6].
- **Moonshot AI** showcased Kimi K2.5's top-ranking 87.4% score on Video-MME [6].
- **Alibaba** used Video-MME results to demonstrate the video understanding capabilities of the [Qwen](/wiki/qwen)3 VL model family, with later Qwen3.x systems topping the leaderboard at 88.0% [6].

The benchmark is integrated into major evaluation frameworks including VLMEvalKit and LMMs-Eval, and its dataset is publicly available through HuggingFace (LMMS-Lab), making it accessible for researchers and developers to evaluate their own models [4][7].

## Technical details

### Data availability

The Video-MME dataset, including videos, annotations, subtitles, and audio files, is publicly available through the project's HuggingFace repository under the LMMS-Lab organization [7]. Evaluation scripts are provided through the official GitHub repository (MME-Benchmarks/Video-MME), and the benchmark is compatible with popular evaluation toolkits [4].

### Evaluation tools

Researchers can evaluate models on Video-MME using [4]:

- **VLMEvalKit**: An open-source evaluation toolkit for large multi-modality models maintained by OpenCompass, which includes Video-MME as one of its supported benchmarks.
- **LMMs-Eval**: Another evaluation framework that supports Video-MME evaluation.
- **Custom scripts**: The official repository provides evaluation scripts that compute accuracy across all relevant dimensions (duration, domain, task type) without requiring third-party AI models.

Output files follow a standardized JSON structure, ensuring consistent and reproducible evaluation across different models and configurations [4].

### Prompt template

The standard evaluation prompt includes [1]:

1. Subtitle context (when evaluating with subtitles), providing the textual content corresponding to the selected video frames
2. The question text
3. Four answer options (A, B, C, D)
4. An instruction to respond with only the letter of the best answer

This standardized format ensures fair comparison across models with different architectures and input processing pipelines.

## Who created Video-MME?

Video-MME was created by a team of 21 researchers from multiple leading institutions [1][2]:

- **Nanjing University** (State Key Laboratory for Novel Software Technology): Chaoyou Fu, Caifeng Shan (corresponding author)
- **Chinese Academy of Sciences** (CASIA, Institute of Automation): Mengdan Zhang, Ran He (corresponding author)
- **University of Science and Technology of China** (State Key Laboratory of Cognitive Intelligence): Yuhan Dai, Sirui Zhao, Tong Xu, Enhong Chen
- **University of Hong Kong** (HKU): Lei Li, Xing Sun
- **Peking University** (PKU): Shuhuai Ren
- **Chinese University of Hong Kong** (CUHK): Renrui Zhang, Yanwei Li
- **East China Normal University** (ECNU): Zihan Wang, Chenyu Zhou, Shaohui Lin
- **Xiamen University** and affiliated labs: Yongdong Luo, Yunhang Shen, Peixian Chen, Ke Li, Xiawu Zheng

The paper was first posted on arXiv in May 2024 (arXiv:2405.21075) and was subsequently accepted at CVPR 2025 [1][2].

## Future directions

The Video-MME authors and the broader research community have identified several directions for advancing video understanding based on the benchmark's findings [1]:

- **Long-context architectures**: The consistent performance degradation on long videos points to the need for architectural innovations in handling extended context windows. Approaches like ring attention and adaptive key-frame identification through temporal Q-Formers represent promising research directions.
- **Training data for temporal reasoning**: The gap between short and long video performance suggests that current training datasets may not adequately emphasize complex temporal reasoning scenarios. Constructing focused training data for multi-step temporal reasoning could yield significant improvements.
- **Multi-modal integration**: The substantial performance gains from subtitle integration indicate untapped potential in better fusing visual, textual, and audio modalities. Future work may explore more sophisticated approaches to cross-modal attention and information integration.
- **Counting and tracking**: The identification of counting as a universal bottleneck highlights the need for improved object tracking and numerical reasoning capabilities in video understanding models.
- **Frame selection strategies**: Current evaluation typically uses uniform frame sampling, but the certificate length analysis suggests that adaptive, content-aware frame selection could significantly improve efficiency and accuracy, particularly for long videos. The arrival of Video-MME-v2 in 2026 takes up several of these threads directly, raising both annotation rigor and the difficulty of the reasoning required [8].

## See Also

- [Multimodal AI](/wiki/multimodal_ai)
- [Computer Vision](/wiki/computer_vision)
- [MME](/wiki/mme)
- [MMBench](/wiki/mmbench)
- [MVBench](/wiki/mvbench)
- [MMMU](/wiki/mmmu)
- [MMLU](/wiki/mmlu)
- [GPT-4o](/wiki/gpt_4o)
- [Gemini](/wiki/gemini)
- [LLaVA](/wiki/llava)

## References

1. Fu, C., Dai, Y., Luo, Y., Li, L., Ren, S., Zhang, R., Wang, Z., Zhou, C., Shen, Y., Zhang, M., Chen, P., Li, Y., Lin, S., Zhao, S., Li, K., Xu, T., Zheng, X., Chen, E., Shan, C., He, R., & Sun, X. (2024). "Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis." arXiv:2405.21075. https://arxiv.org/abs/2405.21075
2. Fu, C. et al. (2025). "Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025), pp. 24108-24118.
3. Video-MME Official Project Page and Leaderboard. https://video-mme.github.io/home_page.html
4. Video-MME GitHub Repository. MME-Benchmarks. https://github.com/MME-Benchmarks/Video-MME
5. OpenAI. (2025). "Introducing GPT-4.1 in the API." https://openai.com/index/gpt-4-1/
6. Video-MME Leaderboard. llm-stats.com. https://llm-stats.com/benchmarks/video-mme
7. Video-MME Dataset on HuggingFace (LMMS-Lab). https://huggingface.co/datasets/lmms-lab/Video-MME
8. Fu, C. et al. (2026). "Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding." arXiv:2604.05015. https://arxiv.org/abs/2604.05015