Ego-Exo4D

Computer Vision Data & Datasets Meta AI

7 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

7 citations

Revision

v2 · 1,334 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Ego-Exo4D is a large-scale, multimodal, multiview video dataset and benchmark suite for computer vision research on skilled human activity. Its defining feature is that the same activity is recorded at the same time from an egocentric (first-person, head-mounted) camera and several exocentric (third-person) cameras, so every action is observed simultaneously from both points of view. The project was produced by a consortium of Meta AI's Fundamental AI Research (FAIR) lab, Meta's Project Aria team, and 15 university partners. It was announced on November 30, 2023, and an associated paper was presented as an oral at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024 ^[1]^[2]^[3].

The dataset is positioned as a successor to Ego4D, the 2021 first-person video dataset built by an overlapping consortium. Where Ego4D captured a broad span of everyday life seen only from the wearer's perspective, Ego-Exo4D narrows the activities to skill-based domains and adds the synchronized third-person views and a much richer sensor stack ^[3]^[4].

Concept and motivation

A long-standing problem in video understanding is that systems trained on third-person footage transfer poorly to first-person footage, and the reverse, because the two viewpoints differ sharply in framing, motion, and what is visible (an egocentric camera often cannot see the wearer's own body, while an exocentric camera cannot see exactly where the person is looking). By filming the same take from both perspectives at once, Ego-Exo4D gives researchers paired ego and exo observations of identical events, which supports learning the correspondence between the two views. The authors argue that skilled activities, where there is a clear notion of doing something well or poorly, are a useful testbed because they expose fine-grained differences in technique that a model must perceive ^[3]^[4].

Scale and contents

The published CVPR 2024 figures describe 740 participants in 13 cities worldwide, performing activities in 123 different natural scene contexts, yielding 1,286 hours of video combined across 5,035 takes, with individual takes running from 1 to 42 minutes ^[1]^[3]. (The original November 2023 announcement reported a larger raw collection of roughly 1,422 hours from about 839 camera wearers across 131 scenes; the 1,286-hour, 740-participant figures correspond to the released and benchmarked version of the dataset, and are the numbers used in the peer-reviewed paper and on the official data site ^[2]^[5].)

Activities span eight skill domains, grouped loosely into procedural tasks (cooking, bike repair, health) and physical/performance skills (music, dance, soccer, basketball, rock climbing/bouldering). The per-domain breakdown from the official data site is below ^[5].

Domain	Takes	Participants	Sites	Hours
Cooking	678	173	60	564.1
Health	397	122	24	114.5
Bike repair	363	32	8	82.2
Music	276	59	8	180.1
Dance	728	93	7	106.6
Soccer	282	78	14	67.0
Basketball	910	113	5	78.0
Rock climbing	1,401	98	2	93.9
Total	5,035	740	123	1,286.3

The health domain includes clinical and care scenarios such as COVID-19 testing and CPR on a manikin, which is why it sits alongside the more recreational skills ^[3]^[4].

Modalities

The dataset is described by its authors as having an unprecedented degree of multimodal coverage. Egocentric capture used Meta's Project Aria glasses, and exocentric capture used four to five stationary GoPro cameras on tripods. The recorded streams and derived data include the following ^[3]^[5]^[6].

Modality	Source / detail
Egocentric video	Aria glasses: one RGB camera plus two wide-angle grayscale (SLAM) cameras
Exocentric video	4 to 5 stationary GoPro cameras
Multichannel audio	7-microphone array on Aria; stereo audio on GoPros
Eye gaze	3D gaze vectors estimated on the Aria device
Inertial sensing	Two IMUs on Aria, plus magnetometer and barometer
Camera poses	Calibrated 6-DoF localization for all cameras
3D structure	Sparse 3D point clouds of each scene
Language	Narrate-and-act, expert commentary, atomic action descriptions

The three language annotation types are a central contribution. Narrate-and-act descriptions are first-person, tutorial-style narrations recorded by participants for a subset of takes. Expert commentary was produced by 52 people with domain expertise, such as coaches and teachers, who reviewed takes and added spoken critiques explaining what was done well or badly; this yielded 117,812 time-stamped, video-aligned comments, and each take also received a single proficiency score from 1 to 10. Atomic action descriptions are roughly 432,000 short, one-sentence labels of individual actions that support search, mining, and video-language tasks ^[3]^[7].

Benchmark tasks

Alongside the data, the consortium released a benchmark suite. The paper groups the challenges into four families: fine-grained activity understanding (keystep recognition), proficiency estimation, ego-exo relation (also called correspondence), and ego pose estimation ^[1]^[3].

Benchmark family	Goal
Keystep recognition	Recognize and segment the fine-grained procedural steps of an activity, in both ego and exo video
Proficiency estimation	Infer how skillfully a person is performing, from per-step feedback to an overall demonstration score
Ego-exo relation	Relate or align content across the two viewpoints, including correspondence and translation between ego and exo views
Ego pose	Recover 3D body and hand pose of the skilled movement from monocular egocentric video

These tasks share the premise that a model must reason jointly about the two viewpoints. For example, the relation benchmark asks a system to connect a teacher's demonstration seen from the outside (exo) with a learner's own first-person view (ego), a setup the authors motivate by analogy to how people learn skills by watching others and then attempting the action themselves ^[3]^[4].

Relationship to Ego4D

Ego-Exo4D builds directly on the Ego4D effort and reuses much of its consortium structure and tooling, while differing in three main ways: it pairs every egocentric recording with synchronized exocentric video rather than capturing first-person footage alone; it restricts the activities to skilled, evaluable domains rather than open-ended daily life; and it expands the sensor and annotation stack, most visibly with the expert-commentary language track and the proficiency benchmark. Both datasets were led by Kristen Grauman of the University of Texas at Austin and FAIR, with senior contributors including Lorenzo Torresani, Kris Kitani, Jitendra Malik, and Triantafyllos Afouras ^[1]^[3]^[4].

Participating institutions

The two-year collection effort involved 15 research institutions in addition to Meta. Named partners include Carnegie Mellon University, Georgia Tech, the International Institute of Information Technology, Hyderabad, Indiana University, the National University of Singapore, Simon Fraser University, the University of the Andes (Universidad de los Andes), the University of Minnesota, the University of North Carolina at Chapel Hill, the University of Pennsylvania, and the University of Tokyo, among others. Recording sites spanned several countries and multiple U.S. states ^[2]^[3].

Availability

The dataset, annotations, pre-extracted features, documentation, and benchmark code are openly released through the official site and the Ego4D download tooling, subject to a data-use license agreement. Egocentric capture relies on Project Aria, Meta's research glasses platform that provides synchronized multi-sensor recording and machine-perception services such as gaze and 6-DoF localization ^[2]^[5]^[6].

References

Grauman, K., Westbury, A., Torresani, L., Kitani, K., Malik, J., Afouras, T., et al. "Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives." CVPR 2024 (CVF Open Access). https://openaccess.thecvf.com/content/CVPR2024/html/Grauman_Ego-Exo4D_Understanding_Skilled_Human_Activity_from_First-_and_Third-Person_Perspectives_CVPR_2024_paper.html ↩
Meta AI. "Introducing Ego-Exo4D: A foundational dataset for research on video learning and multimodal perception." Meta AI Blog, November 30, 2023. https://ai.meta.com/blog/ego-exo4d-video-learning-perception/ ↩
Grauman, K., et al. "Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives" (arXiv:2311.18259, v3). https://arxiv.org/abs/2311.18259 ↩
"Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives" (CVPR 2024 oral page). https://cvpr.thecvf.com/virtual/2024/oral/32013 ↩
Ego-Exo4D official dataset site. https://ego-exo4d-data.org/ ↩
"Ego-Exo4D Documentation: Overview." https://docs.ego-exo4d-data.org/overview/ ↩
"Ego-Exo4D Documentation: Expert Commentary." https://docs.ego-exo4d-data.org/annotations/expert_commentary/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Ego4D Project Aria

Concept and motivation

Scale and contents

Modalities

Benchmark tasks

Relationship to Ego4D

Participating institutions

Availability

References

Improve this article

Related Articles

Segment Anything Model and Dataset (SAM and SA-1B)

Ego4D

Project Aria

Open Catalyst Project

MetaCLIP

DINO (computer vision)

What links here

Related Articles

Segment Anything Model and Dataset (SAM and SA-1B)

Ego4D

Project Aria

Open Catalyst Project

MetaCLIP

DINO (computer vision)

What links here