EgoSchema

AI Benchmarks Computer Vision Multimodal AI

24 min read

Updated Jun 11, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 11, 2026

Fact-checked

In review queue

Sources

17 citations

Revision

v5 · 4,878 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

EgoSchema is a diagnostic benchmark for evaluating very long-form video language understanding, introduced by Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik at UC Berkeley. Published at NeurIPS 2023 in the Datasets and Benchmarks track, EgoSchema consists of 5,031 human-curated multiple-choice question-answer pairs built on top of three-minute egocentric video clips drawn from the Ego4D dataset ^[1]^[7]. Spanning over 250 hours of real-world first-person video footage, the benchmark covers a wide range of natural human activities and behaviors ^[1]. EgoSchema was specifically designed to stress-test models on temporal reasoning over extended video durations, exposing a significant gap between current AI systems and human-level comprehension of long-form video content ^[1].

Background and Motivation

The Challenge of Long-Form Video Understanding

Video understanding has been a central research problem in computer vision and multimodal AI for decades. While considerable progress has been made on tasks involving short video clips (typically a few seconds long), understanding videos that span minutes or longer remains an open challenge. Most existing video question-answering (VideoQA) benchmarks at the time of EgoSchema's introduction relied on clips lasting only 5 to 30 seconds, which did not adequately test a model's ability to reason about events unfolding across longer time horizons ^[1].

The distinction matters because real-world video understanding often demands integrating information scattered across extended temporal windows. A security camera feed, a cooking tutorial, or a workplace recording all require viewers to track objects, activities, and causal relationships over minutes rather than seconds. Existing benchmarks were poorly suited to measuring this kind of reasoning.

Limitations of Prior Benchmarks

Before EgoSchema, popular video question-answering datasets such as MSRVTT-QA, ActivityNet-QA, and TGIF-QA used relatively short clips. While these benchmarks provided useful evaluation signals for spatial recognition and short-term action understanding, they did not challenge models to perform the kind of sustained temporal reasoning needed for real-world applications. Additionally, many existing benchmarks could be partially solved through superficial cues or static frame analysis without genuinely processing the temporal dynamics of the video.

EgoSchema was designed to fill this gap by requiring models to reason over three-minute video clips, a duration that is 10x to 100x longer (in terms of required temporal reasoning) than most other video understanding datasets at the time of its release ^[1].

Why Egocentric Video?

EgoSchema focuses on egocentric (first-person) video, recorded from the perspective of a camera wearer going about daily activities. Egocentric video introduces unique challenges compared to third-person footage. The camera moves with the wearer, causing frequent motion blur, rapid viewpoint changes, and partial occlusions. Objects enter and leave the frame unpredictably, and the visual context shifts constantly as the wearer moves through different environments.

The choice to build on Ego4D, a massive egocentric video dataset collected by a consortium led by Meta AI, gave EgoSchema access to diverse, unscripted real-world footage. Ego4D contains approximately 3,670 hours of daily-life activity video captured by 931 camera wearers across 74 worldwide locations and 9 countries, covering scenarios including household tasks, outdoor activities, workplace interactions, and social situations ^[2].

Dataset Construction

Source Material: Ego4D

EgoSchema draws its video content from the Ego4D dataset. For each question in EgoSchema, a three-minute (180-second) clip is extracted from the larger Ego4D footage ^[1]. These clips capture genuine, unscripted human behavior filmed from a first-person perspective. The diversity of Ego4D's source material ensures that EgoSchema covers a broad spectrum of activities, from cooking and cleaning to crafting, socializing, and navigating different environments.

Ego4D provides temporal narrations for its videos: timestamped natural-language descriptions of what the camera wearer is doing at various points in the footage ^[2]. These narrations served as a critical input to EgoSchema's question generation pipeline.

Scalable Question Generation Pipeline

Creating over 5,000 high-quality multiple-choice questions for long-form video is a labor-intensive task. EgoSchema employs a scalable dataset creation pipeline that combines the capabilities of large language models (LLMs) with human curation to produce challenging questions efficiently ^[1].

The pipeline proceeds through multiple stages:

Stage 1: Video and Narration Filtering. The first stage applies rule-based filtering to select suitable Ego4D RGB videos and their associated temporal narrations. Not all Ego4D clips are equally suited for question generation; the filtering process identifies clips with sufficiently rich narration coverage and diverse activity patterns that can support meaningful comprehension questions.

Stage 2: Question and Answer Generation. Using the filtered narrations as textual descriptions of the video content, the pipeline leverages LLMs to automatically generate questions, correct answers, and plausible distractor options. The LLM receives the concatenated narrations for a three-minute clip and produces multiple-choice questions that require understanding the temporal progression of events described in those narrations. Each question has five answer options (one correct answer and four distractors), making the random baseline 20% ^[1].

Stage 3: Human Curation and Verification. The automatically generated question-answer sets undergo human review and curation. Human annotators verify that questions are answerable from the video content, that the correct answer is unambiguous, and that the distractors are plausible but clearly incorrect. This human-in-the-loop step is essential for ensuring the quality and reliability of the benchmark ^[1]. The "human curated" label in EgoSchema's description reflects this careful verification process.

Dataset Statistics

Statistic	Value
Total questions	5,031
Video clip duration	3 minutes (180 seconds)
Answer options per question	5 (multiple choice)
Total video hours	250+ hours
Random baseline accuracy	20%
Human accuracy	~76%
Source dataset	Ego4D
Public evaluation subset	500 questions
Blind test set	5,031 questions (answers withheld)

The benchmark is structured as a zero-shot evaluation benchmark ^[1]. The full answer key for all 5,031 questions is intentionally withheld from the public to maintain evaluation integrity. A public subset of 500 questions with released answers allows researchers to perform offline development and debugging, while final performance must be evaluated through the official submission pipeline ^[6].

Temporal Certificate Sets

One of EgoSchema's most important intellectual contributions is the concept of temporal certificate sets, a formal framework for measuring the intrinsic temporal reasoning difficulty of video understanding tasks.

The Problem with Raw Video Length

A common assumption in video understanding research is that longer videos are inherently more difficult to understand. However, this is not always the case. A 10-minute video might contain a single short event that answers a given question, meaning the model only needs to identify a few relevant seconds within the full clip. In such cases, the raw duration of the video is misleading as an indicator of temporal complexity.

Temporal certificate sets address this problem by asking: "What is the minimum amount of video that a model must actually process to answer a given question correctly?" Rather than measuring the total length of the video clip, temporal certificate sets measure the total duration of the specific video segments that are necessary and sufficient for answering each question ^[1].

Formal Definition

For a given question-answer pair, the temporal certificate set is defined as the minimal collection of temporal segments from the video that provides enough information to determine the correct answer. If answering a question about a cooking video requires observing the ingredient preparation at minute 0:30, the mixing step at minute 1:15, and the plating at minute 2:45, then the temporal certificate set includes those three segments, and its total length is the sum of their durations.

Implications for Benchmarking

When measured using temporal certificate sets, EgoSchema demonstrates substantially higher temporal complexity than existing benchmarks:

Metric	EgoSchema	Next Closest Dataset
Median temporal certificate length	~100 seconds	~18 seconds
Ratio to next closest	5.7x longer	--
Ratio to typical VideoQA datasets	10x to 100x longer	--

This analysis reveals that EgoSchema questions genuinely require sustained temporal reasoning across the full duration of the three-minute clips. Unlike benchmarks where questions can be answered from a single frame or a few seconds of footage, EgoSchema demands that models integrate information from multiple moments spread throughout the video. The median temporal certificate length of approximately 100 seconds means that, on average, a model must process and integrate roughly 100 seconds of distinct video content to answer each question correctly ^[1].

Question Types and Cognitive Demands

EgoSchema questions are designed to test multiple cognitive faculties related to video understanding. Based on analysis of the question set, the benchmark encompasses several categories of reasoning:

Activity Recognition and Summarization

Many questions ask the model to identify the primary activity occurring in the video or to summarize the overall sequence of events. For example, a question might ask: "What is the primary activity that occurs multiple times in the video?" Answering this requires watching enough of the clip to identify recurring patterns.

Procedural Understanding

Questions in this category test whether the model can follow multi-step procedures. An example might ask: "Describe the main process C performs in the video," where C refers to the camera wearer. The model must track a sequence of related actions and understand how they connect into a coherent procedure.

Goal and Intent Inference

Some questions require inferring the camera wearer's underlying goal or motivation from observed actions. For instance: "What is the overall objective of C's actions in the video?" This goes beyond recognizing individual actions and demands higher-level reasoning about intent.

Temporal Sequencing

Questions may ask about the ordering or progression of events: "What is the primary sequence of actions that C performs?" This tests whether the model can accurately track and report the temporal order of activities.

Interruption and Exception Analysis

Certain questions focus on deviations from the main activity: "Although the video is predominantly focused on one recurring action, there is an interruption in C's activity. Briefly describe this interruption and its significance." These questions require the model to distinguish routine actions from exceptional events.

Expertise and Skill Assessment

Some questions ask the model to make judgments about the camera wearer's competence: "What can be deduced about C's level of expertise based on the video?" This requires synthesizing multiple behavioral cues across the full video to form an assessment.

Tool and Object Usage

Questions about tools and objects test spatial and functional understanding: "What are the main ingredients and tools used?" The model must track which objects appear, how they are used, and their role in the overall activity.

Original Model Evaluation Results (2023)

The initial evaluation of EgoSchema in the original paper revealed striking performance gaps between AI models and humans. The benchmark was tested against several prominent video-language models available at the time ^[1].

Full Test Set Results

Model	Accuracy (%)
VIOLET	19.9
FrozenBiLM	26.9
mPLUG-Owl	31.1
InternVideo	32.1
Human performance	~76.0
Random baseline	20.0

These results were sobering for the field. Even InternVideo, the best-performing model at the time with its strong video-language pretraining, achieved only 32.1% accuracy on the full test set. This was barely above the random baseline of 20%, and far below human performance of approximately 76%. VIOLET performed essentially at chance level (19.9%), suggesting it was unable to leverage temporal information from the long clips at all ^[1].

The paper also noted that even web-scale trained closed-source models with over 100 billion parameters achieved less than 40% accuracy, highlighting how severely current systems struggled with long-form temporal reasoning.

Why Models Struggled

Several factors contributed to the poor model performance:

Context window limitations. Most video-language models at the time were designed to process short clips or a small number of sampled frames. Processing a full three-minute video at reasonable resolution exceeded the input capacity of many architectures.
Temporal abstraction failures. Models often relied on recognizing individual objects or actions from isolated frames rather than building a coherent temporal narrative. EgoSchema questions specifically require integrating information across time, which frame-level recognition cannot accomplish.
Distractor quality. The carefully crafted distractor options in EgoSchema's multiple-choice format mean that superficial pattern matching is insufficient. Distractors are designed to be plausible if the model only considers partial information, rewarding genuine comprehension over educated guessing.
Egocentric complexity. The first-person perspective introduces additional challenges such as frequent camera motion, hand-object interactions filmed at close range, and rapid scene transitions that third-person datasets do not typically present.

Progress Since Release

Since its introduction in 2023, EgoSchema has become a widely adopted benchmark for evaluating multimodal large language models (MLLMs) and video-language systems. Significant progress has been made, though the benchmark remains challenging.

LLM-Based Approaches

A notable breakthrough came from LLoVi (Long-form Video QA with LLMs), which demonstrated that combining visual captioning models with powerful LLMs could dramatically improve performance. Rather than feeding raw video frames into a single end-to-end model, LLoVi uses a visual captioner (such as LaViLa) to generate textual descriptions of video segments, then feeds these descriptions to an LLM for reasoning ^[5].

Approach	Accuracy on Full Set (%)	Accuracy on Subset (%)
LLoVi (zero-shot)	50.3	--
LLoVi (few-shot)	52.5	--

Visual Captioner Comparison (500-Question Subset)

Captioner	Accuracy (%)
VideoBLIP	40.0
EgoVLP	46.6
BLIP-2	46.7
LaViLa	51.8
Oracle captions	65.8

The Oracle result (65.8%) represents an upper bound when using ground-truth narrations instead of model-generated captions, indicating that caption quality is a major bottleneck ^[5].

LLM Backbone Comparison (500-Question Subset)

LLM	Accuracy (%)
Llama 2-7B	34.0
Llama 2-13B	40.4
Llama 2-70B	50.6
GPT-3.5	51.8
GPT-4	58.3

These results show a clear scaling trend: larger and more capable LLMs consistently achieve higher accuracy, suggesting that reasoning capability is a key bottleneck alongside visual perception ^[5].

Ego4D EgoSchema Challenge (CVPR 2024)

EgoSchema was adopted as an official challenge track within the Ego4D Challenge at CVPR 2024. The winning solution, HCQA (Hierarchical Comprehension for Question Answering), achieved 75% accuracy on the full blind test set, approaching human-level performance ^[3].

HCQA employs a three-stage pipeline:

Fine-grained Caption Generation. The system divides each 180-second video into 45 clips at 4-second intervals. LaViLa generates five captions per clip (225 total captions per video) by sampling four frames from each segment ^[3].
Context-driven Summarization. GPT-4o uses prompt learning and one-shot in-context learning to associate and aggregate the fine-grained captions into an overall summary of the video.
Inference-guided Answering. Chain-of-thought reasoning with three-shot in-context learning guides the final answer selection. A reflection mechanism prompts the model to reconsider answers when confidence scores fall below a threshold ^[3].

CVPR 2024 Challenge Leaderboard

Rank	Team	Accuracy (%)
1	HCQA	75
2	GMMV	74
3	PaMsEgoAI	71
4	VeryLongVQA	69
5	LifelongMemory (baseline)	68

On the official 2024 leaderboard, the fifth-ranked team submission was Host 82934 Team at 64%; the HCQA report lists LifelongMemory's 68% as the strongest previously published method, noting that it would have ranked only fifth among challenge entries ^[3].

Ego4D EgoSchema Challenge (CVPR 2025)

The 2025 edition of the challenge saw further improvements. HCQA-1.5, an updated version of the winning 2024 approach, introduced a multi-source aggregation strategy and confidence-based filtering to push accuracy to 77% ^[4]. The top-ranked team, Reality Distortion, achieved 81% accuracy, surpassing human-level performance for the first time ^[4].

HCQA-1.5, developed at Harbin Institute of Technology (Shenzhen), replaces the single GPT-4o answering model of the original pipeline with several models (Gemini 1.5 Pro, GPT-4.1, and Qwen2.5) and directly keeps predictions whose self-reported confidence exceeds 4 on a 5-point scale. Low-confidence cases are reprocessed both by a vision module that feeds 45 uniformly sampled frames to Qwen2.5-VL-72B and by a textual reasoning module based on DeepSeek-R1, with the higher-confidence output selected; the full pipeline reaches 77.3% in the report's ablation study ^[4]. Winners of the 2025 round of Ego4D challenges, including EgoSchema, were announced at the Second Joint Egocentric Vision (EgoVis) Workshop on June 12, 2025 in Nashville ^[16].

CVPR 2025 Challenge Leaderboard

Rank	Team	Accuracy (%)
1	Reality Distortion	81
2	L_PCIE (PCIE)	79
3	HCQA-1.5 (iLearn2.0)	77
4	ccego	76
5	Noah's Ark Lab	75

Baseline Model Comparison (2025)

Model	Accuracy (%)
Qwen2.5-VL-72B	76
GPT-4o	72
Gemini 1.5 Pro	71
GPT-4.1	76.1

In this comparison, the Gemini 1.5 Pro, GPT-4o, and Qwen2.5-VL-72B figures are baseline results listed in the HCQA-1.5 challenge report, while the GPT-4.1 figure of 76.1% is its accuracy when used as the answering model within the HCQA-1.5 pipeline rather than as an end-to-end video model ^[4].

Challenge Status (2026)

EgoSchema is not among the tracks of the Ego4D and EgoExo4D Challenge 2026, which comprises five tracks: Natural Language Queries, Goal Step, and Short-Term Object Interaction Anticipation from Ego4D, plus Ego-Pose Body and Procedure Understanding from Ego-Exo4D ^[14]. The 2026 cycle launched on March 15, 2026, with leaderboards closing on May 13, 2026 and winners announced at the Third Joint Egocentric Vision (EgoVis) Workshop at CVPR 2026 in Denver on June 3, 2026 ^[14]^[15]. The organizers stated that Ego4D challenges not running at CVPR 2026 would become available on the CodaBench platform as part of a migration away from the EvalAI infrastructure ^[14]. The pause of the EgoSchema track from the CVPR challenge lineup followed the 2025 edition, in which the top submissions exceeded the benchmark's estimated human accuracy of approximately 76% ^[4].

Multimodal Model Performance

Beyond the challenge tracks, various multimodal models have been evaluated on EgoSchema through independent benchmarking efforts:

Model	Organization	Accuracy
Qwen2-VL-72B-Instruct	Alibaba Cloud	77.9%
Qwen2.5-VL-72B-Instruct	Alibaba Cloud	76.2%
Grok 3	xAI	74.5%
Grok 3 Mini	xAI	74.3%
GPT-4o	OpenAI	72.2%
Amazon Nova Pro	Amazon	72.1%
Gemini 2.0 Pro	Google	71.9%
Gemini 2.0 Flash	Google	71.5%
Amazon Nova Lite	Amazon	71.4%
Qwen2.5-Omni-7B	Alibaba Cloud	68.6%
Gemini 2.0 Flash-Lite	Google	67.2%
Gemini 1.0 Pro	Google	55.7%

Most of these figures are self-reported by the model developers and aggregated by third-party benchmark trackers; the tracker llm-stats.com lists nine self-reported EgoSchema results, led by Qwen2-VL-72B-Instruct at 77.9% ^[12]. The Qwen2-VL score is documented in the model's technical report ^[13], and the Grok 3 score of 74.5% appears in the benchmark table of xAI's February 2025 Grok 3 announcement ^[11].

Evaluation Methodology

Submission Process

EgoSchema provides three pathways for model evaluation:

Kaggle Leaderboard (primary). Researchers submit predictions through the Kaggle competition platform ("egoschema-public"), which provides automated scoring against the hidden answer key ^[6].
Direct Validation. A validation script accepts JSON-formatted predictions (mapping each question UID to an answer index from 0 to 4) and returns accuracy metrics.
API Endpoint. Predictions can also be submitted via a POST request to the validation server endpoint.

All three methods return accuracy metrics for both the full 5,031-question benchmark and the public 500-question subset ^[6].

Zero-Shot Evaluation Protocol

EgoSchema is designed as a zero-shot evaluation benchmark. There is no official training set, and models are expected to generalize from their pretraining knowledge rather than learning from EgoSchema-specific examples ^[1]. This design choice ensures that benchmark performance reflects genuine video understanding capability rather than dataset-specific overfitting.

Data Format

Each evaluation sample contains:

Field	Description
question_idx	Unique question identifier
question	Natural language question about the video
video_idx	UUID mapping to the source Ego4D video
option	Five answer choices (A through E)
answer	Correct answer (withheld for blind test set)

The dataset is available in multiple configurations on Hugging Face: GENERATION (for generative QA evaluation), MC (standard multiple choice), MC_PPL (multiple choice with perplexity scoring), and a 500-question Subset for offline development ^[7].

EgoSchema occupies a distinct position in the landscape of video understanding benchmarks. The following table compares it with other prominent datasets:

Benchmark	Video Perspective	Typical Clip Length	Question Format	Temporal Certificate Length
EgoSchema	Egocentric (first-person)	3 minutes	5-way multiple choice	~100 seconds
ActivityNet-QA	Third-person	~3 minutes	Open-ended	~18 seconds
MSRVTT-QA	Third-person	10-30 seconds	Open-ended	Short
TGIF-QA	Third-person	~3 seconds	Multiple choice/open	Very short
NExT-QA	Third-person	~44 seconds	Multiple choice	Short-medium
Video-MME	Mixed	Variable (up to hours)	Multiple choice	Variable

EgoSchema's unique combination of egocentric perspective, consistently long clips, and high temporal certificate lengths distinguishes it from other benchmarks. While some newer benchmarks like Video-MME and X-LeBench have extended to even longer videos, EgoSchema remains a foundational reference point for long-form video understanding evaluation.

Technical Details

Accessing the Dataset

The EgoSchema dataset is publicly available through several channels:

GitHub Repository: github.com/egoschema/EgoSchema (109+ stars, benchmarking code included) ^[6]
Kaggle Competition: Provides video downloads and submission infrastructure
Hugging Face Hub: Available as lmms-lab/egoschema with 10,000+ monthly downloads ^[7]
Wasabi Cloud Storage: Direct download via URL mappings for the video files
Google Drive: Zipped video file downloads

The repository includes a benchmarking/ directory with instructions for reproducing the model results reported in the original paper.

License and Terms

EgoSchema is released under the Ego4D license, which governs the use of the underlying video data. Users must accept the Ego4D terms of use to access the videos. The question-answer pairs and evaluation code are open-sourced to facilitate reproducibility ^[1].

Hardware and Compute Considerations

Evaluating models on EgoSchema requires processing three-minute video clips, which can be computationally demanding. At standard frame rates (e.g., 30 fps), each clip contains approximately 5,400 frames. Most evaluation approaches sample a subset of frames or generate intermediate textual representations (captions) to reduce the computational burden. The choice of frame sampling strategy can significantly impact performance.

Impact and Significance

Advancing Long-Form Video Research

EgoSchema has played an important role in driving research progress on long-form video understanding. Before its introduction, the field lacked a widely accepted benchmark specifically targeting temporal reasoning over multi-minute videos. By demonstrating that state-of-the-art models performed near chance level on its questions, EgoSchema made a compelling case that long-term temporal understanding was a major unsolved problem ^[1]. By June 2026, the EgoSchema paper had accumulated roughly 670 citations, according to Semantic Scholar ^[17].

Catalyzing New Approaches

The benchmark has inspired several new research directions:

Caption-then-reason pipelines. The success of approaches like LLoVi ^[5] and HCQA ^[3] has established a new paradigm where visual captioning models first convert video frames to text, and large language models then reason over the textual descriptions. This modular approach has proven more effective than end-to-end video-language models for long-form understanding.
Hierarchical summarization. The challenge of processing 250+ captions per video has motivated research into hierarchical summarization techniques that progressively compress visual information while preserving task-relevant details.
Confidence-based reasoning. HCQA and its successors introduced reflection and confidence-based filtering mechanisms, where the model reconsiders low-confidence answers through additional analysis ^[3]^[4].

Integration with Ego4D Ecosystem

As part of the annual Ego4D Challenge series (held at CVPR workshops since 2024), EgoSchema has become a standard evaluation track alongside other Ego4D tasks like episodic memory, natural language queries, and long-term action anticipation ^[16]. This integration ensures that EgoSchema remains actively maintained and that research progress is regularly tracked through competitive evaluations.

Influence on Benchmark Design

The concept of temporal certificate sets introduced by EgoSchema has influenced how the research community thinks about video benchmark difficulty. Rather than simply using longer videos, benchmark designers now consider the intrinsic temporal reasoning requirements of their questions. This has led to more principled approaches to evaluating temporal understanding in newer benchmarks.

Successor Benchmarks (2024-2026)

As challenge submissions approached and then exceeded EgoSchema's estimated human accuracy in 2024 and 2025 ^[3]^[4], the research community introduced longer and more temporally demanding egocentric benchmarks. HourVideo, presented in the NeurIPS 2024 Datasets and Benchmarks track by researchers at Stanford University, consists of 500 egocentric videos from Ego4D lasting 20 to 120 minutes, paired with 12,976 five-way multiple-choice questions; Gemini 1.5 Pro scored 37.3% against 85.0% for human experts, recreating at hour scale the human-model gap that EgoSchema had exposed at the three-minute scale ^[9]. EgoTempo, presented at CVPR 2025, was motivated by its authors' finding that questions in existing egocentric question-answering datasets can often be answered from a few frames or from commonsense reasoning alone, and is designed to require integrating information across the entire video ^[10].

Limitations

While EgoSchema represents a significant contribution, it has several acknowledged limitations:

Egocentric-only perspective. The benchmark exclusively uses first-person video, which may not generalize to third-person video understanding tasks. Models that excel on EgoSchema may still struggle with surveillance footage, sports broadcasts, or other third-person video formats.
Text-based shortcuts. The caption-then-reason paradigm that dominates EgoSchema leaderboards raises questions about whether top-performing systems truly "understand" video or merely process textual summaries of visual content. A model that never sees the actual video frames but reasons over captions may miss visual details not captured in text.
Multiple-choice format. The five-way multiple-choice format, while convenient for evaluation, is less naturalistic than open-ended question answering. Models may exploit patterns in answer options without genuinely understanding the video content.
Fixed clip duration. All clips are exactly three minutes long, which does not reflect the variable durations of real-world video understanding tasks. Some tasks may require reasoning over seconds, while others may span hours.
English-only. Both the questions and narrations are in English, limiting the benchmark's applicability to multilingual video understanding research.
Language and prior-knowledge shortcuts. Later analyses found that a meaningful fraction of EgoSchema questions can be answered without watching the video at all. The MVU framework paper, published at ICLR 2025, reported that a text-only "Just LLM" baseline given only the question and answer options scored 25.8 percentage points above the 20% random baseline on the EgoSchema subset while using zero visual information ^[8]. Related findings that egocentric VideoQA questions are often solvable from a few frames or commonsense reasoning motivated follow-up benchmarks such as EgoTempo ^[10].

References

Mangalam, K., Akshulakov, R., & Malik, J. (2023). "EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding." *Advances in Neural Information Processing Systems 36 (NeurIPS 2023)*, Datasets and Benchmarks Track. arXiv:2308.09126. ↩
Grauman, K., et al. (2022). "Ego4D: Around the World in 3,000 Hours of Egocentric Video." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022)*. ↩
Zhang, H., et al. (2024). "HCQA @ Ego4D EgoSchema Challenge 2024." *CVPR 2024 Workshop*. arXiv:2406.15771. ↩
Zhang, H., et al. (2025). "HCQA-1.5 @ Ego4D EgoSchema Challenge 2025." arXiv:2505.20644. ↩
Zhang, C., et al. (2024). "A Simple LLM Framework for Long-Range Video Question-Answering." *Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024)*. arXiv:2312.17235. ↩
EgoSchema Official Website and GitHub Repository: https://egoschema.github.io/ and https://github.com/egoschema/EgoSchema. ↩
Hugging Face Dataset: https://huggingface.co/datasets/lmms-lab/egoschema. ↩
Ranasinghe, K., Li, X., Kahatapitiya, K., & Ryoo, M. S. (2025). "Understanding Long Videos with Multimodal Language Models." *International Conference on Learning Representations (ICLR 2025)*. arXiv:2403.16998. ↩
Chandrasegaran, K., Gupta, A., et al. (2024). "HourVideo: 1-Hour Video-Language Understanding." *Advances in Neural Information Processing Systems 37 (NeurIPS 2024)*, Datasets and Benchmarks Track. arXiv:2411.04998. ↩
Plizzari, C., Tonioni, A., Xian, Y., Kulshrestha, A., & Tombari, F. (2025). "Omnia de EgoTempo: Benchmarking Temporal Understanding of Multi-Modal LLMs in Egocentric Videos." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025)*. arXiv:2503.13646. ↩
xAI (2025). "Grok 3 Beta - The Age of Reasoning Agents." February 2025. https://x.ai/news/grok-3. ↩
LLM Stats. "EgoSchema Benchmark Leaderboard." https://llm-stats.com/benchmarks/egoschema. Accessed June 2026. ↩
Wang, P., Bai, S., et al. (2024). "Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution." arXiv:2409.12191. ↩
Ego4D Consortium. "Ego4D and EgoExo4D Challenge 2026." Ego4D documentation. https://ego4d-data.org/docs/challenge/. Accessed June 2026. ↩
"Third Joint Egocentric Vision (EgoVis) Workshop." CVPR 2026, Denver, June 3, 2026. https://egovis.github.io/cvpr26/. ↩
"Second Joint Egocentric Vision (EgoVis) Workshop." CVPR 2025, Nashville, June 12, 2025. https://egovis.github.io/cvpr25/. ↩
Semantic Scholar. Citation record for "EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding." https://www.semanticscholar.org/paper/656a6b3c0348d69cf9f98f95cbf68046941a4f29. Accessed June 2026. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit

What links here

Qwen2-VL Video-MMMU

Background and Motivation

The Challenge of Long-Form Video Understanding

Limitations of Prior Benchmarks

Why Egocentric Video?

Dataset Construction

Source Material: Ego4D

Scalable Question Generation Pipeline

Dataset Statistics

Temporal Certificate Sets

The Problem with Raw Video Length

Formal Definition

Implications for Benchmarking

Question Types and Cognitive Demands

Activity Recognition and Summarization

Procedural Understanding

Goal and Intent Inference

Temporal Sequencing

Interruption and Exception Analysis

Expertise and Skill Assessment

Tool and Object Usage

Original Model Evaluation Results (2023)

Full Test Set Results

Why Models Struggled

Progress Since Release

LLM-Based Approaches

Visual Captioner Comparison (500-Question Subset)

LLM Backbone Comparison (500-Question Subset)

Ego4D EgoSchema Challenge (CVPR 2024)

CVPR 2024 Challenge Leaderboard

Ego4D EgoSchema Challenge (CVPR 2025)

CVPR 2025 Challenge Leaderboard

Baseline Model Comparison (2025)

Challenge Status (2026)

Multimodal Model Performance

Evaluation Methodology

Submission Process

Zero-Shot Evaluation Protocol

Data Format

Comparison with Related Benchmarks

Technical Details

Accessing the Dataset

License and Terms

Hardware and Compute Considerations

Impact and Significance

Advancing Long-Form Video Research

Catalyzing New Approaches

Integration with Ego4D Ecosystem

Influence on Benchmark Design

Successor Benchmarks (2024-2026)

Limitations

See Also

References

Improve this article

Related Articles

Fox (benchmark)

Visual Question Answering Models

CLIP Score

MMMU-Pro

ZeroBench

Video-MME

What links here

Related Articles

Fox (benchmark)

Visual Question Answering Models

CLIP Score

MMMU-Pro

ZeroBench

Video-MME

What links here