Ego4D
Last reviewed
Jun 3, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 1,577 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 1,577 words
Add missing citations, update stale details, or suggest a clearer explanation.
Ego4D is a large-scale egocentric (first-person) video dataset and benchmark suite for computer vision, assembled by Meta AI (then Facebook AI Research) together with a consortium of universities and research labs around the world. It contains 3,670 hours of unscripted daily-life video recorded by 931 camera wearers in 9 countries, paired with dense text narrations and a set of five benchmark tasks for understanding what a person sees and does from their own point of view.[1][2] The project was announced in October 2021, made available to researchers later that year, and described in the paper "Ego4D: Around the World in 3,000 Hours of Egocentric Video," published at the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).[1][3] The lead author is Kristen Grauman of Meta AI and the University of Texas at Austin, with roughly 80 co-authors.[3]
The name reflects the project's framing: egocentric video understood across four dimensions ("4D"), meaning the spatial scene plus time. The motivation, as Grauman put it in the launch announcement, was that next-generation AI "will need to learn from an entirely different kind of data, videos that show the world from the center of the action, rather than the sidelines."[2] Most large video datasets at the time consisted of third-person footage from television, film, or web clips, which is poorly matched to the wearable cameras and augmented-reality devices that motivate first-person perception.
Egocentric video had been studied for years through smaller datasets such as EPIC-Kitchens, EGTEA Gaze+, and Charades-Ego, but these were limited in hours, geography, or scenario diversity. Ego4D was conceived to provide a single open resource that was large enough to train modern models, broad enough to cover many everyday activities and cultures, and rich enough in metadata and annotations to support a family of well-defined tasks. The data was collected over roughly two years, with each partner institution recruiting local camera wearers and capturing video in the settings of their own daily lives, from cooking and cleaning to crafts, sports, shopping, and outdoor work.[1][2]
The released dataset comprises 3,670 hours of video captured by 931 unique camera wearers at 74 worldwide locations across 9 countries.[1][3] (The paper's title rounds the figure to "3,000 hours"; the headline statistics on the project's site have been revised slightly over time as the corpus was finalized.) The earliest October 2021 announcement cited interim totals of more than 2,200 hours and over 700 participants, which grew as collection completed.[2]
Camera wearers were recruited to reflect a range of backgrounds. About 45 percent of participants were female, two identified as non-binary, ages spanned from teens through people over 70, and occupations included bakers, carpenters, landscapers, mechanics, and many others rather than only students or researchers.[4] Footage was recorded on seven different head-mounted or wearable camera models, including GoPro, Vuzix Blade smart glasses, Pupil Labs eye-tracking glasses, ZShades, ORDRO EP6, iVue Rincon 1080, and Weeview, which together produced a mix of monocular, stereo, and gaze-tracked video.[4]
Privacy and ethics were treated as a core part of the collection design. Camera wearers gave informed consent, each institution followed its own research-ethics policy, and the team applied de-identification procedures to redact personally identifiable information such as faces and license plates, using a combination of commercial software, open-source tools, and manual review.[1][3]
| Attribute | Value |
|---|---|
| Total video | 3,670 hours |
| Unique camera wearers | 931 |
| Countries | 9 |
| Collection locations | 74 |
| Consortium institutions | 13 universities and labs, plus Meta (FAIR) |
| Narration sentences | About 3.85 million |
| Narration density | About 13.2 sentences per minute |
| Distinct verbs / nouns in narrations | About 1,772 verbs / 4,336 nouns |
| Camera devices used | 7 wearable models |
| Released | November 2021 (research access); CVPR June 2022 |
A defining feature of Ego4D is that all of the footage is densely narrated in free-form text. Annotators watched short clips, summarized them, then re-watched while pausing repeatedly to write a sentence about each thing the camera wearer did. This produced roughly 3.85 million timestamped sentences, an average density near 13.2 sentences per minute, drawing on about 1,772 distinct verbs and 4,336 distinct nouns.[4] These time-aligned narrations make the data usable for multimodal and language-grounded video understanding, and they later supported large-scale video-language pretraining work such as EgoVLP.
Beyond video and text, portions of the corpus carry additional synchronized modalities, which is what makes the dataset suitable for tasks like 3D localization, gaze estimation, and audio-visual analysis. Not every hour has every modality; coverage varies by capture site and device.
| Modality | Approximate hours |
|---|---|
| Audio | 2,535 |
| 3D environment scans (meshes) | 612 |
| Stereo video | 491 |
| Multiple synchronized cameras (same event) | 224 |
| Eye gaze | 80 |
| Inertial measurement unit (IMU) | 45 |
The 3D scans capture the geometry of the environments, the stereo and gaze streams come from specialized glasses, and the synchronized multi-camera segments record the same event from several first-person viewpoints at once, which is useful for studying social settings.[3][4]
Ego4D ships with five benchmark tasks, each with its own annotations, train/validation/test splits, and evaluation metrics. The organizers group them by temporal perspective: understanding the past (episodic memory), the present (hands and objects, audio-visual diarization, social interactions), and the future (forecasting).[3][5]
| Benchmark | Goal | Representative sub-tasks |
|---|---|---|
| Episodic Memory | Make a person's past video queryable by localizing where an answer can be found | Natural-language queries, visual queries (2D and 3D), moment queries |
| Hands and Objects | Understand how the wearer changes an object's state through manipulation | Point-of-no-return temporal localization, state-change detection, state-change object detection |
| Audio-Visual Diarization | Determine who spoke, when, and what was said | Speaker localization and tracking, active-speaker detection, diarization, speech transcription |
| Social Interactions | Model conversational attention and address in multi-person scenes | "Looking at me" and "talking to me" detection |
| Forecasting | Anticipate the wearer's future motion and interactions | Locomotion (future trajectory) prediction, hand-movement prediction, short-term object-interaction anticipation, long-term action anticipation |
The benchmarks were released with baseline models and public leaderboards, and they have anchored recurring challenge events at major computer-vision workshops. Episodic Memory in particular framed a then-novel problem: rather than classifying or captioning a clip, a system must search hours of a wearer's own footage to find a specific moment, object, or answer.[5]
Meta describes Ego4D as the product of "a consortium of 13 universities and labs across nine countries," working alongside FAIR.[2] Partner institutions named across the project and paper include the University of Bristol, the University of Catania, the University of Tokyo, the National University of Singapore, the King Abdullah University of Science and Technology (KAUST), Carnegie Mellon University and CMU Africa, Georgia Tech, the Massachusetts Institute of Technology, the University of Minnesota, the University of Pennsylvania, Indiana University, the International Institute of Information Technology Hyderabad, and the Universidad de los Andes, with coordination by Meta AI and the University of Texas at Austin.[1][2] Each partner ran its own local recruitment and capture, which is the main reason the geographic and cultural spread is wider than in earlier egocentric datasets.
The dataset is distributed through ego4d-data.org under a data use agreement that restricts use to research and educational purposes, requires safeguarding of de-identified personal data, and asks users to cite the work. Access requires signing this license rather than an unrestricted open download.[6]
Ego4D was widely covered as the largest egocentric video dataset of its kind on release, and it quickly became a standard resource for first-person perception research, wearable computing, and augmented-reality applications.[2] The annotations and benchmarks have been reused well beyond their original tasks, including for video-language model pretraining and for robot-learning research that treats human first-person video as a source of manipulation demonstrations.
The effort was extended in two notable ways. An expanded journal version, "Ego4D: Around the World in 3,600 Hours of Egocentric Video," appeared in IEEE Transactions on Pattern Analysis and Machine Intelligence in 2024.[7] Separately, the same broad consortium produced Ego-Exo4D, a follow-on dataset that pairs egocentric video with synchronized third-person (exocentric) views of skilled human activities, addressing a limitation of Ego4D's purely first-person capture.