Ego-Exo4D
Last reviewed
Jun 3, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 1,336 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 1,336 words
Add missing citations, update stale details, or suggest a clearer explanation.
Ego-Exo4D is a large-scale, multimodal, multiview video dataset and benchmark suite for computer vision research on skilled human activity. Its defining feature is that the same activity is recorded at the same time from an egocentric (first-person, head-mounted) camera and several exocentric (third-person) cameras, so every action is observed simultaneously from both points of view. The project was produced by a consortium of Meta AI's Fundamental AI Research (FAIR) lab, Meta's Project Aria team, and 15 university partners. It was announced on November 30, 2023, and an associated paper was presented as an oral at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024 [1][2][3].
The dataset is positioned as a successor to Ego4D, the 2021 first-person video dataset built by an overlapping consortium. Where Ego4D captured a broad span of everyday life seen only from the wearer's perspective, Ego-Exo4D narrows the activities to skill-based domains and adds the synchronized third-person views and a much richer sensor stack [3][4].
A long-standing problem in video understanding is that systems trained on third-person footage transfer poorly to first-person footage, and the reverse, because the two viewpoints differ sharply in framing, motion, and what is visible (an egocentric camera often cannot see the wearer's own body, while an exocentric camera cannot see exactly where the person is looking). By filming the same take from both perspectives at once, Ego-Exo4D gives researchers paired ego and exo observations of identical events, which supports learning the correspondence between the two views. The authors argue that skilled activities, where there is a clear notion of doing something well or poorly, are a useful testbed because they expose fine-grained differences in technique that a model must perceive [3][4].
The published CVPR 2024 figures describe 740 participants in 13 cities worldwide, performing activities in 123 different natural scene contexts, yielding 1,286 hours of video combined across 5,035 takes, with individual takes running from 1 to 42 minutes [1][3]. (The original November 2023 announcement reported a larger raw collection of roughly 1,422 hours from about 839 camera wearers across 131 scenes; the 1,286-hour, 740-participant figures correspond to the released and benchmarked version of the dataset, and are the numbers used in the peer-reviewed paper and on the official data site [2][5].)
Activities span eight skill domains, grouped loosely into procedural tasks (cooking, bike repair, health) and physical/performance skills (music, dance, soccer, basketball, rock climbing/bouldering). The per-domain breakdown from the official data site is below [5].
| Domain | Takes | Participants | Sites | Hours |
|---|---|---|---|---|
| Cooking | 678 | 173 | 60 | 564.1 |
| Health | 397 | 122 | 24 | 114.5 |
| Bike repair | 363 | 32 | 8 | 82.2 |
| Music | 276 | 59 | 8 | 180.1 |
| Dance | 728 | 93 | 7 | 106.6 |
| Soccer | 282 | 78 | 14 | 67.0 |
| Basketball | 910 | 113 | 5 | 78.0 |
| Rock climbing | 1,401 | 98 | 2 | 93.9 |
| Total | 5,035 | 740 | 123 | 1,286.3 |
The health domain includes clinical and care scenarios such as COVID-19 testing and CPR on a manikin, which is why it sits alongside the more recreational skills [3][4].
The dataset is described by its authors as having an unprecedented degree of multimodal coverage. Egocentric capture used Meta's Project Aria glasses, and exocentric capture used four to five stationary GoPro cameras on tripods. The recorded streams and derived data include the following [3][5][6].
| Modality | Source / detail |
|---|---|
| Egocentric video | Aria glasses: one RGB camera plus two wide-angle grayscale (SLAM) cameras |
| Exocentric video | 4 to 5 stationary GoPro cameras |
| Multichannel audio | 7-microphone array on Aria; stereo audio on GoPros |
| Eye gaze | 3D gaze vectors estimated on the Aria device |
| Inertial sensing | Two IMUs on Aria, plus magnetometer and barometer |
| Camera poses | Calibrated 6-DoF localization for all cameras |
| 3D structure | Sparse 3D point clouds of each scene |
| Language | Narrate-and-act, expert commentary, atomic action descriptions |
The three language annotation types are a central contribution. Narrate-and-act descriptions are first-person, tutorial-style narrations recorded by participants for a subset of takes. Expert commentary was produced by 52 people with domain expertise, such as coaches and teachers, who reviewed takes and added spoken critiques explaining what was done well or badly; this yielded 117,812 time-stamped, video-aligned comments, and each take also received a single proficiency score from 1 to 10. Atomic action descriptions are roughly 432,000 short, one-sentence labels of individual actions that support search, mining, and video-language tasks [3][7].
Alongside the data, the consortium released a benchmark suite. The paper groups the challenges into four families: fine-grained activity understanding (keystep recognition), proficiency estimation, ego-exo relation (also called correspondence), and ego pose estimation [1][3].
| Benchmark family | Goal |
|---|---|
| Keystep recognition | Recognize and segment the fine-grained procedural steps of an activity, in both ego and exo video |
| Proficiency estimation | Infer how skillfully a person is performing, from per-step feedback to an overall demonstration score |
| Ego-exo relation | Relate or align content across the two viewpoints, including correspondence and translation between ego and exo views |
| Ego pose | Recover 3D body and hand pose of the skilled movement from monocular egocentric video |
These tasks share the premise that a model must reason jointly about the two viewpoints. For example, the relation benchmark asks a system to connect a teacher's demonstration seen from the outside (exo) with a learner's own first-person view (ego), a setup the authors motivate by analogy to how people learn skills by watching others and then attempting the action themselves [3][4].
Ego-Exo4D builds directly on the Ego4D effort and reuses much of its consortium structure and tooling, while differing in three main ways: it pairs every egocentric recording with synchronized exocentric video rather than capturing first-person footage alone; it restricts the activities to skilled, evaluable domains rather than open-ended daily life; and it expands the sensor and annotation stack, most visibly with the expert-commentary language track and the proficiency benchmark. Both datasets were led by Kristen Grauman of the University of Texas at Austin and FAIR, with senior contributors including Lorenzo Torresani, Kris Kitani, Jitendra Malik, and Triantafyllos Afouras [1][3][4].
The two-year collection effort involved 15 research institutions in addition to Meta. Named partners include Carnegie Mellon University, Georgia Tech, the International Institute of Information Technology, Hyderabad, Indiana University, the National University of Singapore, Simon Fraser University, the University of the Andes (Universidad de los Andes), the University of Minnesota, the University of North Carolina at Chapel Hill, the University of Pennsylvania, and the University of Tokyo, among others. Recording sites spanned several countries and multiple U.S. states [2][3].
The dataset, annotations, pre-extracted features, documentation, and benchmark code are openly released through the official site and the Ego4D download tooling, subject to a data-use license agreement. Egocentric capture relies on Project Aria, Meta's research glasses platform that provides synchronized multi-sensor recording and machine-perception services such as gaze and 6-DoF localization [2][5][6].