Ego4D

Computer Vision Data & Datasets Meta AI

9 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

7 citations

Revision

v2 · 1,831 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Ego4D is a large-scale egocentric (first-person) video dataset and benchmark suite for computer vision, assembled by Meta AI (then Facebook AI Research) together with a consortium of 13 universities and labs across 9 countries.^[1]^[2] It contains 3,670 hours of unscripted daily-life video recorded by 931 camera wearers at 74 worldwide locations in 9 countries, paired with dense text narrations and five benchmark tasks for understanding what a person sees and does from their own point of view.^[1]^[3] Introduced by Meta in October 2021 and described in the CVPR 2022 paper "Ego4D: Around the World in 3,000 Hours of Egocentric Video" (Grauman et al.), it was the largest egocentric video dataset of its kind at release and remains a standard resource for first-person perception, wearable computing, and augmented-reality research.^[1]^[2]^[3]

The project was announced in October 2021, made available to researchers later that year, and described in the paper published at the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).^[1]^[3] The lead author is Kristen Grauman of Meta AI and the University of Texas at Austin, with roughly 80 co-authors.^[3]

The name reflects the project's framing: egocentric video understood across four dimensions ("4D"), meaning the spatial scene plus time. The motivation, as Grauman put it in the launch announcement, was that next-generation AI "will need to learn from an entirely different kind of data, videos that show the world from the center of the action, rather than the sidelines."^[2] Most large video datasets at the time consisted of third-person footage from television, film, or web clips, which is poorly matched to the wearable cameras and augmented-reality devices that motivate first-person perception.

What is Ego4D?

Ego4D is an open research dataset of first-person video plus a suite of five machine-learning benchmarks built on top of it. Where most prior video datasets show events from the outside (third-person, or exocentric), Ego4D is recorded by people wearing head-mounted cameras as they go about ordinary activities, so every frame shows the world from the camera wearer's own viewpoint. The goal is to train and evaluate AI systems that can perceive, remember, anticipate, and reason about daily life the way a wearable assistant or augmented-reality device would have to.^[1]^[2]

Why was Ego4D created?

Egocentric video had been studied for years through smaller datasets such as EPIC-Kitchens, EGTEA Gaze+, and Charades-Ego, but these were limited in hours, geography, or scenario diversity. Ego4D was conceived to provide a single open resource that was large enough to train modern models, broad enough to cover many everyday activities and cultures, and rich enough in metadata and annotations to support a family of well-defined tasks. The data was collected over roughly two years, with each partner institution recruiting local camera wearers and capturing video in the settings of their own daily lives, from cooking and cleaning to crafts, sports, shopping, and outdoor work.^[1]^[2]

How big is the Ego4D dataset?

The released dataset comprises 3,670 hours of video captured by 931 unique camera wearers at 74 worldwide locations across 9 countries.^[1]^[3] (The paper's title rounds the figure to "3,000 hours"; the headline statistics on the project's site have been revised slightly over time as the corpus was finalized.) The earliest October 2021 announcement cited interim totals of more than 2,200 hours and over 700 participants, which grew as collection completed.^[2]

Camera wearers were recruited to reflect a range of backgrounds. About 45 percent of participants were female, two identified as non-binary, ages spanned from teens through people over 70, and occupations included bakers, carpenters, landscapers, mechanics, and many others rather than only students or researchers.^[4] Footage was recorded on seven different head-mounted or wearable camera models, including GoPro, Vuzix Blade smart glasses, Pupil Labs eye-tracking glasses, ZShades, ORDRO EP6, iVue Rincon 1080, and Weeview, which together produced a mix of monocular, stereo, and gaze-tracked video.^[4]

Privacy and ethics were treated as a core part of the collection design. Camera wearers gave informed consent, each institution followed its own research-ethics policy, and the team applied de-identification procedures to redact personally identifiable information such as faces and license plates, using a combination of commercial software, open-source tools, and manual review.^[1]^[3]

Dataset statistics

Attribute	Value
Total video	3,670 hours
Unique camera wearers	931
Countries	9
Collection locations	74
Consortium institutions	13 universities and labs, plus Meta (FAIR)
Narration sentences	About 3.85 million
Narration density	About 13.2 sentences per minute
Distinct verbs / nouns in narrations	About 1,772 verbs / 4,336 nouns
Camera devices used	7 wearable models
Released	November 2021 (research access); CVPR June 2022

What annotations and modalities does Ego4D include?

A defining feature of Ego4D is that all of the footage is densely narrated in free-form text. Annotators watched short clips, summarized them, then re-watched while pausing repeatedly to write a sentence about each thing the camera wearer did. This produced roughly 3.85 million timestamped sentences, an average density near 13.2 sentences per minute, drawing on about 1,772 distinct verbs and 4,336 distinct nouns.^[4] These time-aligned narrations make the data usable for multimodal and language-grounded video understanding, and they later supported large-scale video-language pretraining work such as EgoVLP.

Beyond video and text, portions of the corpus carry additional synchronized modalities, which is what makes the dataset suitable for tasks like 3D localization, gaze estimation, and audio-visual analysis. Not every hour has every modality; coverage varies by capture site and device.

Modality	Approximate hours
Audio	2,535
3D environment scans (meshes)	612
Stereo video	491
Multiple synchronized cameras (same event)	224
Eye gaze	80
Inertial measurement unit (IMU)	45

The 3D scans capture the geometry of the environments, the stereo and gaze streams come from specialized glasses, and the synchronized multi-camera segments record the same event from several first-person viewpoints at once, which is useful for studying social settings.^[3]^[4]

What are the Ego4D benchmark tasks?

Ego4D ships with five benchmark tasks, each with its own annotations, train/validation/test splits, and evaluation metrics. The organizers group them by temporal perspective: understanding the past (episodic memory), the present (hands and objects, audio-visual diarization, social interactions), and the future (forecasting).^[3]^[5]

Benchmark	Goal	Representative sub-tasks
Episodic Memory	Make a person's past video queryable by localizing where an answer can be found	Natural-language queries, visual queries (2D and 3D), moment queries
Hands and Objects	Understand how the wearer changes an object's state through manipulation	Point-of-no-return temporal localization, state-change detection, state-change object detection
Audio-Visual Diarization	Determine who spoke, when, and what was said	Speaker localization and tracking, active-speaker detection, diarization, speech transcription
Social Interactions	Model conversational attention and address in multi-person scenes	"Looking at me" and "talking to me" detection
Forecasting	Anticipate the wearer's future motion and interactions	Locomotion (future trajectory) prediction, hand-movement prediction, short-term object-interaction anticipation, long-term action anticipation

The benchmarks were released with baseline models and public leaderboards, and they have anchored recurring challenge events at major computer-vision workshops. Episodic Memory in particular framed a then-novel problem: rather than classifying or captioning a clip, a system must search hours of a wearer's own footage to find a specific moment, object, or answer.^[5]

Who built Ego4D?

Meta describes Ego4D as the product of "a consortium of 13 universities and labs across nine countries," working alongside FAIR.^[2] Partner institutions named across the project and paper include the University of Bristol, the University of Catania, the University of Tokyo, the National University of Singapore, the King Abdullah University of Science and Technology (KAUST), Carnegie Mellon University and CMU Africa, Georgia Tech, the Massachusetts Institute of Technology, the University of Minnesota, the University of Pennsylvania, Indiana University, the International Institute of Information Technology Hyderabad, and the Universidad de los Andes, with coordination by Meta AI and the University of Texas at Austin.^[1]^[2] Each partner ran its own local recruitment and capture, which is the main reason the geographic and cultural spread is wider than in earlier egocentric datasets.

How do you access Ego4D, and is it open?

The dataset is distributed through ego4d-data.org under a data use agreement that restricts use to research and educational purposes, requires safeguarding of de-identified personal data, and asks users to cite the work. Access requires signing this license rather than an unrestricted open download, so Ego4D is openly available for research but is not released under a permissive open-source or public-domain license.^[6]

What impact has Ego4D had?

Ego4D was widely covered as the largest egocentric video dataset of its kind on release, and it quickly became a standard resource for first-person perception research, wearable computing, and augmented-reality applications.^[2] The annotations and benchmarks have been reused well beyond their original tasks, including for video-language model pretraining and for robot-learning research that treats human first-person video as a source of manipulation demonstrations.

The effort was extended in two notable ways. An expanded journal version, "Ego4D: Around the World in 3,600 Hours of Egocentric Video," appeared in IEEE Transactions on Pattern Analysis and Machine Intelligence in 2024.^[7] Separately, the same broad consortium produced Ego-Exo4D, a follow-on dataset that pairs egocentric video with synchronized third-person (exocentric) views of skilled human activities, addressing a limitation of Ego4D's purely first-person capture.

How is Ego4D different from Ego-Exo4D?

Ego4D records only the first-person (egocentric) viewpoint of everyday activities, whereas Ego-Exo4D captures the same skilled activity simultaneously from a head-mounted egocentric camera and several external (exocentric) cameras. Ego-Exo4D is therefore aimed at learning correspondences between first-person and third-person views of the same action, for example to study how a task is performed by an expert, while Ego4D focuses on perceiving, remembering, and forecasting daily life from the wearer's own perspective.

References

Grauman, K., et al. "Ego4D: Around the World in 3,000 Hours of Egocentric Video." arXiv preprint arXiv:2110.07058, 2021. https://arxiv.org/abs/2110.07058 ↩
Meta AI. "Teaching AI to perceive the world through your eyes." Meta AI Blog, October 14, 2021. https://ai.meta.com/blog/teaching-ai-to-perceive-the-world-through-your-eyes/ ↩
Grauman, K., et al. "Ego4D: Around the World in 3,000 Hours of Egocentric Video." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 18995-19012. https://openaccess.thecvf.com/content/CVPR2022/html/Grauman_Ego4D_Around_the_World_in_3000_Hours_of_Egocentric_Video_CVPR_2022_paper.html ↩
"Ego4D: A Large-Scale Egocentric Video Benchmark." Emergent Mind topic summary (citing Grauman et al.). https://www.emergentmind.com/topics/ego4d-dataset ↩
Ego4D Consortium. "Benchmarks Overview." Ego4D documentation. https://ego4d-data.org/docs/benchmarks/overview/ ↩
Ego4D Consortium. "Egocentric 4D Perception (Ego4D)" project site and data use agreement. https://ego4d-data.org/ ↩
Grauman, K., et al. "Ego4D: Around the World in 3,600 Hours of Egocentric Video." IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. https://ieeexplore.ieee.org/document/10611736/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Ego-Exo4D EgoSchema Project Aria

What is Ego4D?

Why was Ego4D created?

How big is the Ego4D dataset?

Dataset statistics

What annotations and modalities does Ego4D include?

What are the Ego4D benchmark tasks?

Who built Ego4D?

How do you access Ego4D, and is it open?

What impact has Ego4D had?

How is Ego4D different from Ego-Exo4D?

See also

References

Improve this article

Related Articles

Segment Anything Model and Dataset (SAM and SA-1B)

Ego-Exo4D

Project Aria

Open Catalyst Project

MetaCLIP

DINO (computer vision)

What links here

Related Articles

Segment Anything Model and Dataset (SAM and SA-1B)

Ego-Exo4D

Project Aria

Open Catalyst Project

MetaCLIP

DINO (computer vision)

What links here