EgoSchema is a diagnostic benchmark for evaluating very long-form video language understanding, introduced by Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik at UC Berkeley. Published at NeurIPS 2023 in the Datasets and Benchmarks track, EgoSchema consists of 5,031 human-curated multiple-choice question-answer pairs built on top of three-minute egocentric video clips drawn from the Ego4D dataset. Spanning over 250 hours of real-world first-person video footage, the benchmark covers a wide range of natural human activities and behaviors. EgoSchema was specifically designed to stress-test models on temporal reasoning over extended video durations, exposing a significant gap between current AI systems and human-level comprehension of long-form video content.
Video understanding has been a central research problem in computer vision and multimodal AI for decades. While considerable progress has been made on tasks involving short video clips (typically a few seconds long), understanding videos that span minutes or longer remains an open challenge. Most existing video question-answering (VideoQA) benchmarks at the time of EgoSchema's introduction relied on clips lasting only 5 to 30 seconds, which did not adequately test a model's ability to reason about events unfolding across longer time horizons.
The distinction matters because real-world video understanding often demands integrating information scattered across extended temporal windows. A security camera feed, a cooking tutorial, or a workplace recording all require viewers to track objects, activities, and causal relationships over minutes rather than seconds. Existing benchmarks were poorly suited to measuring this kind of reasoning.
Before EgoSchema, popular video question-answering datasets such as MSRVTT-QA, ActivityNet-QA, and TGIF-QA used relatively short clips. While these benchmarks provided useful evaluation signals for spatial recognition and short-term action understanding, they did not challenge models to perform the kind of sustained temporal reasoning needed for real-world applications. Additionally, many existing benchmarks could be partially solved through superficial cues or static frame analysis without genuinely processing the temporal dynamics of the video.
EgoSchema was designed to fill this gap by requiring models to reason over three-minute video clips, a duration that is 10x to 100x longer (in terms of required temporal reasoning) than most other video understanding datasets at the time of its release.
EgoSchema focuses on egocentric (first-person) video, recorded from the perspective of a camera wearer going about daily activities. Egocentric video introduces unique challenges compared to third-person footage. The camera moves with the wearer, causing frequent motion blur, rapid viewpoint changes, and partial occlusions. Objects enter and leave the frame unpredictably, and the visual context shifts constantly as the wearer moves through different environments.
The choice to build on Ego4D, a massive egocentric video dataset collected by a consortium led by Meta AI, gave EgoSchema access to diverse, unscripted real-world footage. Ego4D contains approximately 3,670 hours of daily-life activity video captured by 931 camera wearers across 74 worldwide locations and 9 countries, covering scenarios including household tasks, outdoor activities, workplace interactions, and social situations.
EgoSchema draws its video content from the Ego4D dataset. For each question in EgoSchema, a three-minute (180-second) clip is extracted from the larger Ego4D footage. These clips capture genuine, unscripted human behavior filmed from a first-person perspective. The diversity of Ego4D's source material ensures that EgoSchema covers a broad spectrum of activities, from cooking and cleaning to crafting, socializing, and navigating different environments.
Ego4D provides temporal narrations for its videos: timestamped natural-language descriptions of what the camera wearer is doing at various points in the footage. These narrations served as a critical input to EgoSchema's question generation pipeline.
Creating over 5,000 high-quality multiple-choice questions for long-form video is a labor-intensive task. EgoSchema employs a scalable dataset creation pipeline that combines the capabilities of large language models (LLMs) with human curation to produce challenging questions efficiently.
The pipeline proceeds through multiple stages:
Stage 1: Video and Narration Filtering. The first stage applies rule-based filtering to select suitable Ego4D RGB videos and their associated temporal narrations. Not all Ego4D clips are equally suited for question generation; the filtering process identifies clips with sufficiently rich narration coverage and diverse activity patterns that can support meaningful comprehension questions.
Stage 2: Question and Answer Generation. Using the filtered narrations as textual descriptions of the video content, the pipeline leverages LLMs to automatically generate questions, correct answers, and plausible distractor options. The LLM receives the concatenated narrations for a three-minute clip and produces multiple-choice questions that require understanding the temporal progression of events described in those narrations. Each question has five answer options (one correct answer and four distractors), making the random baseline 20%.
Stage 3: Human Curation and Verification. The automatically generated question-answer sets undergo human review and curation. Human annotators verify that questions are answerable from the video content, that the correct answer is unambiguous, and that the distractors are plausible but clearly incorrect. This human-in-the-loop step is essential for ensuring the quality and reliability of the benchmark. The "human curated" label in EgoSchema's description reflects this careful verification process.
| Statistic | Value |
|---|---|
| Total questions | 5,031 |
| Video clip duration | 3 minutes (180 seconds) |
| Answer options per question | 5 (multiple choice) |
| Total video hours | 250+ hours |
| Random baseline accuracy | 20% |
| Human accuracy | ~76% |
| Source dataset | Ego4D |
| Public evaluation subset | 500 questions |
| Blind test set | 5,031 questions (answers withheld) |
The benchmark is structured as a zero-shot evaluation benchmark. The full answer key for all 5,031 questions is intentionally withheld from the public to maintain evaluation integrity. A public subset of 500 questions with released answers allows researchers to perform offline development and debugging, while final performance must be evaluated through the official submission pipeline.
One of EgoSchema's most important intellectual contributions is the concept of temporal certificate sets, a formal framework for measuring the intrinsic temporal reasoning difficulty of video understanding tasks.
A common assumption in video understanding research is that longer videos are inherently more difficult to understand. However, this is not always the case. A 10-minute video might contain a single short event that answers a given question, meaning the model only needs to identify a few relevant seconds within the full clip. In such cases, the raw duration of the video is misleading as an indicator of temporal complexity.
Temporal certificate sets address this problem by asking: "What is the minimum amount of video that a model must actually process to answer a given question correctly?" Rather than measuring the total length of the video clip, temporal certificate sets measure the total duration of the specific video segments that are necessary and sufficient for answering each question.
For a given question-answer pair, the temporal certificate set is defined as the minimal collection of temporal segments from the video that provides enough information to determine the correct answer. If answering a question about a cooking video requires observing the ingredient preparation at minute 0:30, the mixing step at minute 1:15, and the plating at minute 2:45, then the temporal certificate set includes those three segments, and its total length is the sum of their durations.
When measured using temporal certificate sets, EgoSchema demonstrates substantially higher temporal complexity than existing benchmarks:
| Metric | EgoSchema | Next Closest Dataset |
|---|---|---|
| Median temporal certificate length | ~100 seconds | ~18 seconds |
| Ratio to next closest | 5.7x longer | -- |
| Ratio to typical VideoQA datasets | 10x to 100x longer | -- |
This analysis reveals that EgoSchema questions genuinely require sustained temporal reasoning across the full duration of the three-minute clips. Unlike benchmarks where questions can be answered from a single frame or a few seconds of footage, EgoSchema demands that models integrate information from multiple moments spread throughout the video. The median temporal certificate length of approximately 100 seconds means that, on average, a model must process and integrate roughly 100 seconds of distinct video content to answer each question correctly.
EgoSchema questions are designed to test multiple cognitive faculties related to video understanding. Based on analysis of the question set, the benchmark encompasses several categories of reasoning:
Many questions ask the model to identify the primary activity occurring in the video or to summarize the overall sequence of events. For example, a question might ask: "What is the primary activity that occurs multiple times in the video?" Answering this requires watching enough of the clip to identify recurring patterns.
Questions in this category test whether the model can follow multi-step procedures. An example might ask: "Describe the main process C performs in the video," where C refers to the camera wearer. The model must track a sequence of related actions and understand how they connect into a coherent procedure.
Some questions require inferring the camera wearer's underlying goal or motivation from observed actions. For instance: "What is the overall objective of C's actions in the video?" This goes beyond recognizing individual actions and demands higher-level reasoning about intent.
Questions may ask about the ordering or progression of events: "What is the primary sequence of actions that C performs?" This tests whether the model can accurately track and report the temporal order of activities.
Certain questions focus on deviations from the main activity: "Although the video is predominantly focused on one recurring action, there is an interruption in C's activity. Briefly describe this interruption and its significance." These questions require the model to distinguish routine actions from exceptional events.
Some questions ask the model to make judgments about the camera wearer's competence: "What can be deduced about C's level of expertise based on the video?" This requires synthesizing multiple behavioral cues across the full video to form an assessment.
Questions about tools and objects test spatial and functional understanding: "What are the main ingredients and tools used?" The model must track which objects appear, how they are used, and their role in the overall activity.
The initial evaluation of EgoSchema in the original paper revealed striking performance gaps between AI models and humans. The benchmark was tested against several prominent video-language models available at the time.
| Model | Accuracy (%) |
|---|---|
| VIOLET | 19.9 |
| FrozenBiLM | 26.9 |
| mPLUG-Owl | 31.1 |
| InternVideo | 32.1 |
| Human performance | ~76.0 |
| Random baseline | 20.0 |
These results were sobering for the field. Even InternVideo, the best-performing model at the time with its strong video-language pretraining, achieved only 32.1% accuracy on the full test set. This was barely above the random baseline of 20%, and far below human performance of approximately 76%. VIOLET performed essentially at chance level (19.9%), suggesting it was unable to leverage temporal information from the long clips at all.
The paper also noted that even web-scale trained closed-source models with over 100 billion parameters achieved less than 40% accuracy, highlighting how severely current systems struggled with long-form temporal reasoning.
Several factors contributed to the poor model performance:
Context window limitations. Most video-language models at the time were designed to process short clips or a small number of sampled frames. Processing a full three-minute video at reasonable resolution exceeded the input capacity of many architectures.
Temporal abstraction failures. Models often relied on recognizing individual objects or actions from isolated frames rather than building a coherent temporal narrative. EgoSchema questions specifically require integrating information across time, which frame-level recognition cannot accomplish.
Distractor quality. The carefully crafted distractor options in EgoSchema's multiple-choice format mean that superficial pattern matching is insufficient. Distractors are designed to be plausible if the model only considers partial information, rewarding genuine comprehension over educated guessing.
Egocentric complexity. The first-person perspective introduces additional challenges such as frequent camera motion, hand-object interactions filmed at close range, and rapid scene transitions that third-person datasets do not typically present.
Since its introduction in 2023, EgoSchema has become a widely adopted benchmark for evaluating multimodal large language models (MLLMs) and video-language systems. Significant progress has been made, though the benchmark remains challenging.
A notable breakthrough came from LLoVi (Long-form Video QA with LLMs), which demonstrated that combining visual captioning models with powerful LLMs could dramatically improve performance. Rather than feeding raw video frames into a single end-to-end model, LLoVi uses a visual captioner (such as LaViLa) to generate textual descriptions of video segments, then feeds these descriptions to an LLM for reasoning.
| Approach | Accuracy on Full Set (%) | Accuracy on Subset (%) |
|---|---|---|
| LLoVi (zero-shot) | 50.3 | -- |
| LLoVi (few-shot) | 52.5 | -- |
| Captioner | Accuracy (%) |
|---|---|
| VideoBLIP | 40.0 |
| EgoVLP | 46.6 |
| BLIP-2 | 46.7 |
| LaViLa | 51.8 |
| Oracle captions | 65.8 |
The Oracle result (65.8%) represents an upper bound when using ground-truth narrations instead of model-generated captions, indicating that caption quality is a major bottleneck.
| LLM | Accuracy (%) |
|---|---|
| Llama 2-7B | 34.0 |
| Llama 2-13B | 40.4 |
| Llama 2-70B | 50.6 |
| GPT-3.5 | 51.8 |
| GPT-4 | 58.3 |
These results show a clear scaling trend: larger and more capable LLMs consistently achieve higher accuracy, suggesting that reasoning capability is a key bottleneck alongside visual perception.
EgoSchema was adopted as an official challenge track within the Ego4D Challenge at CVPR 2024. The winning solution, HCQA (Hierarchical Comprehension for Question Answering), achieved 75% accuracy on the full blind test set, approaching human-level performance.
HCQA employs a three-stage pipeline:
Fine-grained Caption Generation. The system divides each 180-second video into 45 clips at 4-second intervals. LaViLa generates five captions per clip (225 total captions per video) by sampling four frames from each segment.
Context-driven Summarization. GPT-4o uses prompt learning and one-shot in-context learning to associate and aggregate the fine-grained captions into an overall summary of the video.
Inference-guided Answering. Chain-of-thought reasoning with three-shot in-context learning guides the final answer selection. A reflection mechanism prompts the model to reconsider answers when confidence scores fall below a threshold.
| Rank | Team | Accuracy (%) |
|---|---|---|
| 1 | HCQA | 75 |
| 2 | GMMV | 74 |
| 3 | PaMsEgoAI | 71 |
| 4 | VeryLongVQA | 69 |
| 5 | LifelongMemory (baseline) | 68 |
The 2025 edition of the challenge saw further improvements. HCQA-1.5, an updated version of the winning 2024 approach, introduced a multi-source aggregation strategy and confidence-based filtering to push accuracy to 77%. The top-ranked team, Reality Distortion, achieved 81% accuracy, surpassing human-level performance for the first time.
| Rank | Team | Accuracy (%) |
|---|---|---|
| 1 | Reality Distortion | 81 |
| 2 | L_PCIE (PCIE) | 79 |
| 3 | HCQA-1.5 (iLearn2.0) | 77 |
| 4 | ccego | 76 |
| 5 | Noah's Ark Lab | 75 |
| Model | Accuracy (%) |
|---|---|
| Qwen2.5-VL-72B | 76 |
| GPT-4o | 72 |
| Gemini 1.5 Pro | 71 |
| GPT-4.1 | 76.1 |
Beyond the challenge tracks, various multimodal models have been evaluated on EgoSchema through independent benchmarking efforts:
| Model | Organization | Accuracy |
|---|---|---|
| Qwen2-VL-72B-Instruct | Alibaba Cloud | 77.9% |
| Qwen2.5-VL-72B-Instruct | Alibaba Cloud | 76.2% |
| Grok 3 | xAI | 74.5% |
| Grok 3 Mini | xAI | 74.3% |
| GPT-4o | OpenAI | 72.2% |
| Amazon Nova Pro | Amazon | 72.1% |
| Gemini 2.0 Pro | 71.9% | |
| Gemini 2.0 Flash | 71.5% | |
| Amazon Nova Lite | Amazon | 71.4% |
| Qwen2.5-Omni-7B | Alibaba Cloud | 68.6% |
| Gemini 2.0 Flash-Lite | 67.2% | |
| Gemini 1.0 Pro | 55.7% |
EgoSchema provides three pathways for model evaluation:
Kaggle Leaderboard (primary). Researchers submit predictions through the Kaggle competition platform ("egoschema-public"), which provides automated scoring against the hidden answer key.
Direct Validation. A validation script accepts JSON-formatted predictions (mapping each question UID to an answer index from 0 to 4) and returns accuracy metrics.
API Endpoint. Predictions can also be submitted via a POST request to the validation server endpoint.
All three methods return accuracy metrics for both the full 5,031-question benchmark and the public 500-question subset.
EgoSchema is designed as a zero-shot evaluation benchmark. There is no official training set, and models are expected to generalize from their pretraining knowledge rather than learning from EgoSchema-specific examples. This design choice ensures that benchmark performance reflects genuine video understanding capability rather than dataset-specific overfitting.
Each evaluation sample contains:
| Field | Description |
|---|---|
| question_idx | Unique question identifier |
| question | Natural language question about the video |
| video_idx | UUID mapping to the source Ego4D video |
| option | Five answer choices (A through E) |
| answer | Correct answer (withheld for blind test set) |
The dataset is available in multiple configurations on Hugging Face: GENERATION (for generative QA evaluation), MC (standard multiple choice), MC_PPL (multiple choice with perplexity scoring), and a 500-question Subset for offline development.
EgoSchema occupies a distinct position in the landscape of video understanding benchmarks. The following table compares it with other prominent datasets:
| Benchmark | Video Perspective | Typical Clip Length | Question Format | Temporal Certificate Length |
|---|---|---|---|---|
| EgoSchema | Egocentric (first-person) | 3 minutes | 5-way multiple choice | ~100 seconds |
| ActivityNet-QA | Third-person | ~3 minutes | Open-ended | ~18 seconds |
| MSRVTT-QA | Third-person | 10-30 seconds | Open-ended | Short |
| TGIF-QA | Third-person | ~3 seconds | Multiple choice/open | Very short |
| NExT-QA | Third-person | ~44 seconds | Multiple choice | Short-medium |
| Video-MME | Mixed | Variable (up to hours) | Multiple choice | Variable |
EgoSchema's unique combination of egocentric perspective, consistently long clips, and high temporal certificate lengths distinguishes it from other benchmarks. While some newer benchmarks like Video-MME and X-LeBench have extended to even longer videos, EgoSchema remains a foundational reference point for long-form video understanding evaluation.
The EgoSchema dataset is publicly available through several channels:
lmms-lab/egoschema with 10,000+ monthly downloadsThe repository includes a benchmarking/ directory with instructions for reproducing the model results reported in the original paper.
EgoSchema is released under the Ego4D license, which governs the use of the underlying video data. Users must accept the Ego4D terms of use to access the videos. The question-answer pairs and evaluation code are open-sourced to facilitate reproducibility.
Evaluating models on EgoSchema requires processing three-minute video clips, which can be computationally demanding. At standard frame rates (e.g., 30 fps), each clip contains approximately 5,400 frames. Most evaluation approaches sample a subset of frames or generate intermediate textual representations (captions) to reduce the computational burden. The choice of frame sampling strategy can significantly impact performance.
EgoSchema has played an important role in driving research progress on long-form video understanding. Before its introduction, the field lacked a widely accepted benchmark specifically targeting temporal reasoning over multi-minute videos. By demonstrating that state-of-the-art models performed near chance level on its questions, EgoSchema made a compelling case that long-term temporal understanding was a major unsolved problem.
The benchmark has inspired several new research directions:
Caption-then-reason pipelines. The success of approaches like LLoVi and HCQA has established a new paradigm where visual captioning models first convert video frames to text, and large language models then reason over the textual descriptions. This modular approach has proven more effective than end-to-end video-language models for long-form understanding.
Hierarchical summarization. The challenge of processing 250+ captions per video has motivated research into hierarchical summarization techniques that progressively compress visual information while preserving task-relevant details.
Confidence-based reasoning. HCQA and its successors introduced reflection and confidence-based filtering mechanisms, where the model reconsiders low-confidence answers through additional analysis.
As part of the annual Ego4D Challenge series (held at CVPR workshops since 2024), EgoSchema has become a standard evaluation track alongside other Ego4D tasks like episodic memory, natural language queries, and long-term action anticipation. This integration ensures that EgoSchema remains actively maintained and that research progress is regularly tracked through competitive evaluations.
The concept of temporal certificate sets introduced by EgoSchema has influenced how the research community thinks about video benchmark difficulty. Rather than simply using longer videos, benchmark designers now consider the intrinsic temporal reasoning requirements of their questions. This has led to more principled approaches to evaluating temporal understanding in newer benchmarks.
While EgoSchema represents a significant contribution, it has several acknowledged limitations:
Egocentric-only perspective. The benchmark exclusively uses first-person video, which may not generalize to third-person video understanding tasks. Models that excel on EgoSchema may still struggle with surveillance footage, sports broadcasts, or other third-person video formats.
Text-based shortcuts. The caption-then-reason paradigm that dominates EgoSchema leaderboards raises questions about whether top-performing systems truly "understand" video or merely process textual summaries of visual content. A model that never sees the actual video frames but reasons over captions may miss visual details not captured in text.
Multiple-choice format. The five-way multiple-choice format, while convenient for evaluation, is less naturalistic than open-ended question answering. Models may exploit patterns in answer options without genuinely understanding the video content.
Fixed clip duration. All clips are exactly three minutes long, which does not reflect the variable durations of real-world video understanding tasks. Some tasks may require reasoning over seconds, while others may span hours.
English-only. Both the questions and narrations are in English, limiting the benchmark's applicability to multilingual video understanding research.