Video-MME (Video Multi-Modal Evaluation) is the first comprehensive, full-spectrum benchmark designed to evaluate multimodal large language models (MLLMs) on video analysis tasks. Introduced in a 2024 paper by Chaoyou Fu and collaborators from Nanjing University, the Chinese Academy of Sciences, Peking University, the Chinese University of Hong Kong, and several other institutions, Video-MME addresses a critical gap in the evaluation landscape by testing models across diverse video types, temporal durations, and data modalities. The benchmark was accepted at CVPR 2025 and has since become an industry-standard measure for assessing multimodal video understanding capabilities.
Video-MME consists of 900 manually selected videos spanning 254 total hours, paired with 2,700 expert-annotated question-answer pairs. Its design emphasizes four distinguishing features: diversity in video types across 6 primary visual domains and 30 subfields, temporal breadth from 11-second clips to hour-long recordings, multi-modal input integration (video frames, subtitles, and audio), and rigorous human annotation quality. Major AI companies including OpenAI, Google, and Moonshot AI have adopted Video-MME as a key benchmark for reporting the video understanding performance of their flagship models.
Before Video-MME, the evaluation landscape for video-capable multimodal models suffered from several limitations. Existing benchmarks such as MVBench, EgoSchema, and Video-Bench primarily focused on short video clips, typically under three minutes in length. This narrow temporal scope failed to capture the challenges that arise when models must process and reason over longer video content, where maintaining context, tracking narrative threads, and synthesizing information across extended timelines become essential.
Additionally, many prior benchmarks relied on videos sourced from existing academic datasets rather than open-domain content, limiting the diversity of real-world scenarios tested. Most also evaluated only visual frame understanding, ignoring the role of other modalities like subtitles and audio in comprehensive video comprehension.
The Video-MME authors identified several specific gaps in previous work:
These limitations motivated the creation of Video-MME as a "full-spectrum" evaluation tool that systematically addresses each shortcoming through careful dataset design and annotation methodology.
Video-MME comprises 900 videos with a combined duration of 254 hours. Each video is paired with exactly 3 expert-crafted questions, yielding 2,700 question-answer pairs in total. All questions follow a multiple-choice format with 4 answer options (A through D). The dataset also includes 744 subtitle files, with all long-duration videos accompanied by subtitles and 900 corresponding audio files.
A defining feature of Video-MME is its systematic coverage of three temporal categories:
| Duration Category | Time Range | Description |
|---|---|---|
| Short | Less than 2 minutes | Brief clips capturing single events or actions, averaging around 26 seconds of certificate length per question |
| Medium | 4 to 15 minutes | Mid-length videos requiring tracking of multiple events, with median certificate lengths around 167 seconds |
| Long | 30 to 60 minutes | Extended recordings demanding sustained context and temporal reasoning, with median certificate lengths of approximately 891 seconds |
The overall duration range spans from 11 seconds to 1 hour. The average video duration across the entire dataset is approximately 1,018 seconds (about 17 minutes). Each duration category contains 300 videos, ensuring balanced representation.
The concept of "certificate length" is central to Video-MME's design. A certificate is defined as the minimum set of video sub-clips necessary and sufficient to convince a human verifier that the marked annotation is correct. The substantially longer certificate lengths for medium and long videos, compared to benchmarks like EgoSchema (which has a certificate length of roughly 100 seconds), confirm that Video-MME genuinely requires deep engagement with extended video content rather than relying on isolated frames.
Video-MME spans 6 primary visual domains, further divided into 30 fine-grained subfields:
| Domain | Example Subfields |
|---|---|
| Knowledge | Astronomy, technology, science education, documentaries |
| Film and Television | News reports, TV dramas, movie clips, interviews |
| Sports Competition | Football, basketball, esports, track and field |
| Artistic Performance | Magic shows, music performances, dance, theater |
| Life Record | Fashion, cooking, travel vlogs, daily activities |
| Multilingual | Videos in languages other than English, covering varied cultural contexts |
All videos were sourced from YouTube across 30 categorized tags, ensuring a wide variety of real-world content. This open-domain approach contrasts with benchmarks that draw from curated academic datasets, providing a more realistic assessment of how models handle the diversity of video content encountered in practical applications.
Beyond video frames, Video-MME integrates two additional modalities:
This multi-modal design allows researchers to evaluate models under different input configurations: video frames only, video frames with subtitles, and video frames with subtitles and audio.
Video-MME employs a rigorous three-step annotation pipeline:
Video Collection: Videos are sourced from YouTube across the 30 categorized tags, with deliberate selection across the three duration categories. The collection process prioritizes diversity in content type, visual complexity, and linguistic context.
Question-Answer Annotation: Expert annotators with strong English proficiency and extensive research experience watch each video in its entirety before designing 3 questions per video. Each question includes 4 potential answer options. Annotators are instructed to create questions that genuinely require understanding the video content rather than relying on common knowledge or visual shortcuts.
Quality Review: A multi-stage verification process ensures annotation quality. Cross-annotator verification checks for clarity and logical consistency. Questions are then filtered using Gemini 1.5 Pro in a question-only setting (without access to the video) to confirm that the questions cannot be answered from text alone. In this filtering step, the model achieved less than 15% accuracy, confirming that the questions truly depend on video understanding. Questions that failed quality checks were returned to annotators for revision.
Video-MME encompasses 12 distinct task types that span multiple cognitive levels:
| Task Category | Examples |
|---|---|
| Perception | Action recognition, object recognition, attribute perception |
| Temporal Understanding | Temporal perception, temporal localization, event sequencing |
| Spatial Reasoning | Spatial relationship identification, scene layout understanding |
| Information Synthesis | Information synopsis, plot summarization, cross-segment reasoning |
| Counting | Object counting, event frequency estimation |
| Complex Reasoning | Causal inference, multi-step logical deduction |
Shorter videos tend to emphasize perception-related tasks (identifying objects, actions, or attributes visible in a few frames), while longer videos predominantly feature tasks related to temporal reasoning, information synthesis, and complex multi-step reasoning that requires integrating information from across the full video.
The complexity of questions and answers scales with video duration:
| Metric | Short Videos | Medium Videos | Long Videos |
|---|---|---|---|
| Average question word count | 11.5 | 12.2 | 14.5 |
| Average option word count | 17.2 | 20.6 | 31.0 |
| Average answer word count | 4.0 | 5.0 | 7.5 |
| Average subtitle word count | 198.6 | 1,425.6 | 6,515.6 |
The substantial increase in option and answer length for long videos reflects the greater complexity and specificity required to distinguish correct answers in extended content scenarios.
Models are evaluated under two primary settings:
For frame-based evaluation, the benchmark specifies that if a model extracts N frames per video, the N subtitles corresponding to those frame timestamps should be provided when evaluating with subtitles.
Evaluation uses straightforward accuracy: the percentage of questions for which the model selects the correct answer option. Scores are reported across multiple dimensions:
The evaluation pipeline uses automated scripts without reliance on third-party AI models for scoring, ensuring reproducibility and consistency.
Models receive a standardized prompt that includes the subtitle context (when applicable), the question text, and the four answer options labeled A through D. The model is asked to respond with a single letter corresponding to its chosen answer.
The initial Video-MME paper evaluated a range of commercial and open-source models. The results revealed a significant performance hierarchy and several notable patterns.
Commercial Models:
| Model | Short (w/o subs) | Short (w/ subs) | Medium (w/o subs) | Medium (w/ subs) | Long (w/o subs) | Long (w/ subs) | Overall (w/o subs) | Overall (w/ subs) |
|---|---|---|---|---|---|---|---|---|
| Gemini 1.5 Pro | 81.7% | 84.5% | 74.3% | 81.0% | 67.4% | 77.4% | 75.0% | 81.3% |
| GPT-4o | 80.0% | 82.8% | 70.3% | 76.6% | 65.3% | 72.1% | 71.9% | 77.2% |
| Gemini 1.5 Flash | 78.8% | 79.8% | 68.8% | 74.7% | 61.1% | 68.8% | 70.3% | 75.0% |
| GPT-4V | 70.5% | 73.2% | 55.8% | 59.7% | 53.5% | 56.9% | 59.9% | 63.3% |
Open-Source Video Models:
| Model | Short (w/o subs) | Short (w/ subs) | Medium (w/o subs) | Medium (w/ subs) | Long (w/o subs) | Long (w/ subs) | Overall (w/o subs) | Overall (w/ subs) |
|---|---|---|---|---|---|---|---|---|
| VILA-1.5 | 68.1% | 68.9% | 58.1% | 57.4% | 50.8% | 52.0% | 59.0% | 59.4% |
| LLaVA-NeXT-Video | 61.7% | 65.1% | 50.1% | 52.2% | 44.3% | 47.2% | 52.0% | 54.9% |
| VideoChat2-Mistral | 48.3% | 52.8% | 37.0% | 39.4% | 33.2% | 39.2% | 39.5% | 43.8% |
| ShareGPT4Video | 48.3% | 53.6% | 36.3% | 39.3% | 35.0% | 37.9% | 39.9% | 43.6% |
| Chat-UniVi-V1.5 | 45.7% | 51.2% | 40.3% | 44.6% | 35.8% | 41.8% | 40.6% | 45.9% |
| ST-LLM | 45.7% | 48.4% | 36.8% | 41.4% | 31.3% | 36.9% | 37.9% | 42.3% |
| Video-LLaVA | 45.3% | 46.1% | 38.0% | 40.7% | 36.2% | 38.1% | 39.9% | 41.6% |
Image-Based Models (Applied to Video):
| Model | Short (w/o subs) | Short (w/ subs) | Overall (w/o subs) | Overall (w/ subs) |
|---|---|---|---|---|
| InternVL-Chat-V1.5 | 60.2% | 61.7% | 50.7% | 52.4% |
| Qwen-VL-Max | 55.8% | 57.6% | 51.3% | 51.2% |
| Qwen-VL-Chat | 46.9% | 47.3% | 41.1% | 41.9% |
As models have improved rapidly, the Video-MME leaderboard has continued to evolve. Notable recent scores include:
| Model | Organization | Parameters | Overall Score |
|---|---|---|---|
| Kimi K2.5 | Moonshot AI | 1.0T | 87.4% |
| Gemini 2.5 Pro | Not disclosed | 84.8% | |
| video-SALMONN 2+ | Tsinghua / ByteDance | 72B | 81.6% (w/ subs) |
| Gemini 1.5 Pro | Not disclosed | 81.3% (w/ subs) | |
| Gemini 1.5 Flash | Not disclosed | 76.1% | |
| Qwen3 VL 30B A3B Instruct | Alibaba | 31B | 74.5% |
| Qwen3 VL 30B A3B Thinking | Alibaba | 31B | 73.3% |
| Qwen3 VL 8B Thinking | Alibaba | 9B | 71.8% |
| Qwen3 VL 8B Instruct | Alibaba | 9B | 71.4% |
| GPT-4.1 | OpenAI | Not disclosed | 72.0% (long, w/o subs) |
| Gemini 1.5 Flash 8B | 8B | 66.2% | |
| Phi-4-multimodal-instruct | Microsoft | 6B | 55.0% |
OpenAI specifically highlighted Video-MME as an "industry standard measure" of multimodal long-context ability when introducing GPT-4.1 in April 2025, reporting that GPT-4.1 scored 72.0% on the long video, no subtitles category, a 6.7 percentage point improvement over GPT-4o's performance on the same subset.
One of Video-MME's most consistent findings is that all models, both commercial and open-source, show declining accuracy as video duration increases. In the original evaluation, Gemini 1.5 Pro dropped from 81.7% on short videos to 67.4% on long videos (without subtitles), a decline of 14.3 percentage points. This pattern held across every model tested, highlighting long-form video understanding as a fundamental challenge for current architectures.
The performance gap between commercial and open-source models also widens with duration. While the gap on short videos is moderate (about 12-15 percentage points between Gemini 1.5 Pro and VILA-1.5), it grows more substantial on long videos (approximately 17 percentage points), suggesting that processing and reasoning over extended temporal contexts is where proprietary models hold their greatest advantage.
The integration of subtitle information consistently boosted model performance across all duration categories. For Gemini 1.5 Pro, subtitles improved overall accuracy by 6.3 percentage points (from 75.0% to 81.3%). The benefit was most pronounced for long videos, where subtitle information provided a 10.0 percentage point improvement.
This finding carries practical implications: it suggests that multi-modal integration, particularly the combination of visual and textual inputs, is essential for robust video understanding. Models that can effectively leverage subtitle or transcript information have a meaningful advantage, especially on longer content where visual information alone becomes harder to track.
The multilingual category showed the largest improvements from subtitle integration, with gains of up to 16.7 percentage points on long videos. This makes sense because subtitles bridge language barriers that would otherwise make non-English video content particularly challenging for models trained primarily on English data.
An unexpected finding was that image-based multimodal models (designed for static image understanding and applied to individual video frames) achieved performance comparable to purpose-built video models. InternVL-Chat-V1.5, an image model, scored 50.7% overall without subtitles, competitive with video-specific models like VideoChat2-Mistral (39.5%) and even approaching some video models.
This result led the authors to conclude that "image understanding is the foundation of video understanding," suggesting that strong visual perception on individual frames provides a solid baseline for video tasks, even without explicit temporal modeling. However, the gap between image models and the best video models (and especially commercial models with longer context windows) indicates that temporal reasoning remains crucial for achieving top performance.
Across virtually all models evaluated, counting tasks (estimating the number of objects, people, or events in a video) emerged as a joint bottleneck. Both commercial and open-source models struggled significantly with questions requiring precise counting, suggesting that this capability demands improvements in both visual processing (tracking individual entities across frames) and numerical reasoning.
Model performance is not uniform across the six visual domains. In the original evaluation, artistic performance tended to yield the highest scores (81.5% for Gemini 1.5 Pro with subtitles), while sports competition produced the lowest (77.7%). This variation reflects the different cognitive demands of each domain: sports videos require tracking fast-moving objects and understanding domain-specific rules, while artistic performances may be more amenable to visual and audio cues.
Video-MME occupies a distinct position in the landscape of video understanding benchmarks:
| Feature | Video-MME | MVBench | EgoSchema | Video-Bench | TempCompass |
|---|---|---|---|---|---|
| Number of videos | 900 | 4,000 (clips) | 5,000 | 7,585 | 410 |
| Maximum video duration | 1 hour | ~16 seconds | 180 seconds | ~60 seconds | ~120 seconds |
| Average video duration | ~17 minutes | ~16 seconds | 180 seconds | ~56 seconds | ~30 seconds |
| Annotation method | Manual (expert) | Automatic | Manual | Mixed | Manual |
| Subtitle support | Yes | No | No | No | No |
| Audio support | Yes | No | No | No | No |
| Number of QA pairs | 2,700 | 4,000 | 5,000 | 17,036 | 7,540 |
| Open domain | Yes | Yes | No (egocentric) | Partial | Yes |
| Duration categories | Short, medium, long | Single | Single | Single | Single |
Video-MME's primary advantages are its coverage of long-form video content, its multi-modal evaluation approach (testing with and without subtitles and audio), its expert manual annotation quality, and its systematic three-tier duration design. While other benchmarks may have more questions or videos, none combine the temporal range, modal breadth, and annotation rigor of Video-MME.
Video-MME has achieved rapid adoption as a standard benchmark across the AI industry. Several major model releases have used Video-MME scores as primary evidence of video understanding capabilities:
The benchmark is integrated into major evaluation frameworks including VLMEvalKit and LMMs-Eval, and its dataset is publicly available through HuggingFace (LMMS-Lab), making it accessible for researchers and developers to evaluate their own models.
The Video-MME dataset, including videos, annotations, subtitles, and audio files, is publicly available through the project's HuggingFace repository under the LMMS-Lab organization. Evaluation scripts are provided through the official GitHub repository (MME-Benchmarks/Video-MME), and the benchmark is compatible with popular evaluation toolkits.
Researchers can evaluate models on Video-MME using:
Output files follow a standardized JSON structure, ensuring consistent and reproducible evaluation across different models and configurations.
The standard evaluation prompt includes:
This standardized format ensures fair comparison across models with different architectures and input processing pipelines.
Video-MME was created by a team of 21 researchers from multiple leading institutions:
The paper was first posted on arXiv in May 2024 (arXiv:2405.21075) and was subsequently accepted at CVPR 2025.
The Video-MME authors and the broader research community have identified several directions for advancing video understanding based on the benchmark's findings: