# Pose estimation

> Source: https://aiwiki.ai/wiki/pose_estimation
> Updated: 2026-06-23
> Categories: Computer Vision, Deep Learning, Machine Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**Pose estimation** is the [computer vision](/wiki/computer_vision) task of detecting and localizing the keypoints (also called landmarks or joints) of a human body, hand, face, animal, or rigid object in images and video, then connecting them into a skeleton or recovering an object's orientation. Given an input image, a pose estimation system outputs a set of 2D or 3D coordinates for predefined keypoints such as elbows, wrists, knees, or facial features. The dominant human-pose benchmark, the [COCO](/wiki/coco_dataset) keypoint dataset, defines 17 body keypoints, and the field's standard accuracy metric is Object Keypoint Similarity (OKS), which scores predictions from 0 to 1 against scale-normalized ground truth.[17] The resulting skeleton captures the spatial configuration of a body and drives applications ranging from sports analytics and [augmented reality](/wiki/augmented_reality) to markerless motion capture and humanoid-robot teleoperation.

Pose estimation has become one of the most actively researched problems in computer vision since the introduction of [deep learning](/wiki/deep_learning)-based methods in 2014. The first system to formulate it as a deep neural network regression problem, DeepPose, appeared at CVPR 2014, and the first real-time multi-person system, OpenPose, won the inaugural COCO 2016 Keypoints Challenge.[1][3] Modern models reach over 80 AP on COCO and run at hundreds of frames per second, enabling deployment on mobile devices and edge hardware. This article covers the fundamentals of 2D and 3D pose estimation, key architectural paradigms, landmark models, the OKS-based evaluation metrics, major datasets, 6DoF object pose, and applications.

## What does a pose estimation model output?

A pose estimation model outputs, for each detected instance, a list of keypoint coordinates plus a per-keypoint confidence or visibility score. For 2D human pose the output is the pixel coordinates (x, y) of each joint; for 3D pose it is (x, y, z) in metric or camera-relative space; for 6DoF object pose it is a 3D rotation plus a 3D translation that together place a known object model in the camera frame.[18] The choice of keypoint set is defined by the dataset: [COCO](/wiki/coco_dataset) uses 17 body keypoints, MPII uses 16, MediaPipe's BlazePose uses 33 body landmarks, and a hand skeleton uses 21.[9][17]

## 2D Pose Estimation

2D pose estimation aims to predict the pixel coordinates (x, y) of each keypoint from a single RGB image. The output is a 2D skeleton that overlays the original image. Most modern approaches use a [convolutional neural network](/wiki/convolutional_neural_network) (CNN) or [vision transformer](/wiki/vision_transformer) (ViT) backbone to extract features from the input image, followed by a decoder head that produces either heatmaps or direct coordinate predictions.

### Heatmap-Based Methods

The dominant paradigm in 2D pose estimation represents each keypoint as a 2D Gaussian heatmap. The network outputs one heatmap per keypoint, where the intensity at each spatial location reflects the confidence that the keypoint is located there. The final keypoint position is obtained by finding the peak (argmax) of each heatmap. This approach was popularized by Tompson et al. (2014) and became the standard after the success of the Stacked Hourglass network (Newell et al., 2016).[2]

### Regression-Based Methods

An alternative approach directly regresses the (x, y) coordinates of each keypoint from image features. DeepPose (Toshev and Szegedy, 2014) was the first work to formulate pose estimation as a direct regression problem using deep neural networks.[1] While regression methods are simpler, heatmap-based methods generally achieve higher accuracy because heatmaps preserve spatial information and are easier to optimize. However, recent coordinate classification methods such as SimCC have narrowed this gap.[8]

## 3D Pose Estimation

3D pose estimation extends the task to predict (x, y, z) coordinates, recovering the depth dimension that is lost in 2D projections. There are two primary strategies:

**Lifting-based methods** first run a 2D pose estimator to detect keypoints, then use a separate network to "lift" the 2D skeleton into 3D space. This two-stage approach leverages the strength of mature 2D detectors and the availability of large 2D training datasets. Martinez et al. (2017) demonstrated that a simple fully connected network could achieve competitive 3D pose accuracy when given accurate 2D keypoints as input.[11]

**End-to-end methods** directly predict 3D joint positions from raw images without an intermediate 2D step. These methods typically require training data with ground-truth 3D annotations, which are more expensive to acquire and often collected using multi-camera motion capture systems.

3D pose estimation has additional challenges, including depth ambiguity (multiple 3D configurations can produce the same 2D projection), self-occlusion, and the limited availability of in-the-wild 3D annotated data.

## Multi-Person Pose Estimation

When multiple people appear in an image, the system must detect all individuals and assign keypoints to the correct person. Two paradigms address this challenge.

### Top-Down Approach

The top-down approach first uses a person detector (such as [Faster R-CNN](/wiki/object_detection) or [YOLO](/wiki/yolo)) to locate bounding boxes around each individual, then applies a single-person pose estimator independently within each cropped region. This paradigm generally achieves higher accuracy because the pose model can focus on one person at a time with normalized scale. The drawback is that runtime scales linearly with the number of people in the image, since every detected person requires a separate forward pass through the pose network. Top-down methods also depend heavily on the quality of the person detector; missed detections or false positives directly affect pose estimation results.

### Bottom-Up Approach

The bottom-up approach detects all keypoints in the image simultaneously, regardless of which person they belong to, and then groups them into individual skeletons using association algorithms. [OpenPose](/wiki/openpose) pioneered this paradigm with Part Affinity Fields (PAFs), which encode the direction and location of limb connections.[3] Because the network processes the entire image in a single forward pass, the OpenPose authors note their parsing step "maintains high accuracy while achieving realtime performance, irrespective of the number of people in the image."[3] This makes bottom-up methods more efficient in crowded scenes. However, the grouping step is more challenging, and bottom-up methods have traditionally been less accurate than top-down methods, especially for small or occluded persons.

## Key Models

The following sections describe several influential models that have shaped the development of pose estimation.

### DeepPose (2014)

DeepPose, proposed by Alexander Toshev and Christian Szegedy at Google, was the first work to apply deep neural networks to human pose estimation. Published at CVPR 2014, it formulated pose estimation as a DNN-based regression problem, using a cascade of regressors built on [AlexNet](/wiki/alexnet) to predict joint coordinates directly.[1] While its accuracy has since been surpassed, DeepPose demonstrated that end-to-end learning with deep networks could replace hand-crafted features for pose estimation.

### Stacked Hourglass Network (2016)

Alexi Newell, Kaiyu Yang, and Jia Deng introduced the Stacked Hourglass Network at ECCV 2016.[2] The architecture features a series of "hourglass" modules that repeatedly downsample and upsample feature maps, capturing information at multiple scales. Intermediate supervision at each hourglass stage enables the network to progressively refine its predictions. The design achieved state-of-the-art results on the MPII and FLIC benchmarks and established heatmap regression as the dominant paradigm for 2D pose estimation.[2]

### OpenPose (2017)

OpenPose, developed by Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh at Carnegie Mellon University, was the first real-time multi-person 2D pose estimation system using a bottom-up approach. The method introduces Part Affinity Fields (PAFs), 2D vector fields that encode the location and orientation of limbs, to associate detected keypoints with specific individuals.[3]

The architecture uses the first 10 layers of a pretrained [VGG](/wiki/vgg)-19 network as a feature extractor. The extracted features pass through a multi-stage CNN with two branches: one branch iteratively refines PAFs for limb association, while the other refines confidence maps (heatmaps) for keypoint detection. A greedy bipartite matching algorithm then assembles the detected keypoints and limb associations into complete skeletons.[3]

OpenPose won the inaugural COCO 2016 Keypoints Challenge and significantly outperformed prior methods on the MPII Multi-Person benchmark. The open-source OpenPose library, released by CMU's Perceptual Computing Lab, supports detection of body, foot, hand, and facial keypoints, jointly producing up to 135 keypoints on a single image, which made it one of the most widely adopted pose estimation tools in research and industry.[4]

### HRNet (2019)

Deep High-Resolution Representation Learning for Human Pose Estimation, known as HRNet, was proposed by Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang at CVPR 2019.[5] Unlike conventional architectures that encode the input into low-resolution representations and then recover high-resolution features through upsampling (as in the Stacked Hourglass or encoder-decoder designs), HRNet "maintains high-resolution representations through the whole process."[5]

The architecture starts with a single high-resolution subnetwork and gradually adds parallel lower-resolution subnetworks. These multi-resolution branches exchange information through repeated multi-scale fusion, where features from different resolutions are aggregated. This design preserves fine-grained spatial details that are critical for precise keypoint localization.

HRNet-W48 achieves approximately 75.5 AP on the COCO test-dev set (single model, standard input size).[5] Variants of HRNet (HRNet-W32, HRNet-W48) have been widely adopted as backbones for numerous vision tasks beyond pose estimation, including [semantic segmentation](/wiki/image_segmentation) and [object detection](/wiki/object_detection).

### ViTPose (2022)

ViTPose, published at [NeurIPS](/wiki/neurips) 2022 by Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao, demonstrated that plain, non-hierarchical [vision transformers](/wiki/vision_transformer) can serve as strong backbones for pose estimation.[6] The model uses a standard ViT backbone for feature extraction paired with a lightweight decoder.

ViTPose's key strength is scalability. The model can be configured from approximately 100 million to over 1 billion parameters. The largest variant, ViTPose-G (ViTAE-G backbone, about 1 billion parameters), achieves 80.9 AP on the COCO test-dev set as a single model, setting a new state-of-the-art record at the time of publication; an ensemble reached 81.1 AP, surpassing the previous best, UDP++, which had ensembled 17 models for 80.8 AP.[6] The follow-up work ViTPose++ (TPAMI 2023) extended the framework to generic body pose estimation across humans and animals.[7]

### RTMPose (2023)

RTMPose, developed by Tao Jiang, Peng Lu, Li Zhang, and colleagues at OpenMMLab, is a high-performance real-time pose estimation framework built on top of MMPose.[8] The work systematically explores factors affecting the speed-accuracy tradeoff, including model architecture, training strategy, and deployment optimization.

RTMPose adopts a top-down paradigm with a CSPNeXt backbone (originally designed for object detection) and a SimCC-based head that treats keypoint localization as a coordinate classification problem rather than heatmap regression. Instead of predicting 2D Gaussian heatmaps, SimCC discretizes the coordinate space into bins and classifies horizontal and vertical positions separately, using Gaussian label smoothing for training.[8]

RTMPose-m achieves 75.8% AP on COCO while running at over 90 FPS on an Intel i7-11700 CPU and over 430 FPS on an NVIDIA GTX 1660 Ti GPU. The smallest variant, RTMPose-s, achieves 72.2% AP on COCO with over 70 FPS on a Snapdragon 865 mobile chip, making it practical for edge deployment.[8]

### MediaPipe Pose

[MediaPipe](/wiki/mediapipe) Pose, developed by Google and built on the BlazePose model, is an on-device ML solution for real-time body pose tracking.[9] It uses a two-step detector-tracker pipeline: a lightweight detector first locates the person region of interest, and a tracker then predicts 33 body landmarks with 3D coordinates (x, y, z) and a visibility score. Because the tracker operates on a tight crop around the previously detected pose, it avoids running the detector on every frame, resulting in efficient real-time performance: the network produces its 33 keypoints at over 30 frames per second on a Pixel 2 phone.[9]

MediaPipe extends beyond body pose through a suite of related solutions:

- **MediaPipe Hands** detects 21 3D hand keypoints per hand using a palm detection model followed by a hand landmark model.
- **MediaPipe Face Mesh** estimates 468 3D facial landmarks in real time, reconstructing the approximate 3D facial surface from a single camera without a depth sensor.
- **MediaPipe Holistic** combines all three models into a unified pipeline that simultaneously predicts 543 landmarks: 33 body pose landmarks, 468 face landmarks, and 21 hand landmarks per hand.

MediaPipe's cross-platform support (Android, iOS, web, Python) and lightweight inference have made it one of the most accessible pose estimation solutions for application developers.

## Hand Pose Estimation

Hand pose estimation is a specialized subtask focused on detecting the 3D positions of finger joints and fingertips. A typical hand skeleton consists of 21 keypoints: one at the wrist and four per finger (metacarpophalangeal, proximal interphalangeal, distal interphalangeal, and tip joints). Hand pose estimation is significantly more challenging than body pose due to frequent self-occlusion between fingers, the small size of hands relative to the image, and the high degree of articulation with over 20 degrees of freedom.

Datasets for hand pose include the FreiHAND dataset, the Rendered Hand Pose Dataset (RHD), InterHand2.6M, and AssemblyHands (which contains 3.0 million annotated images for egocentric 3D hand pose). Models such as MediaPipe Hands, HaMeR (which uses a large-scale Vision [Transformer](/wiki/transformer)), and various regression-based approaches have advanced hand keypoint detection for applications in gesture recognition, [AR/VR](/wiki/augmented_reality) interaction, robotic manipulation, and sign language understanding.

## Face Landmark Detection

Face landmark detection identifies keypoints on facial features such as the eyebrows, eyes, nose, mouth contour, and jawline. The standard annotation scheme uses 68 landmarks as defined in the iBUG 300-W dataset. Dlib, a widely used C++ library, provides a pretrained 68-point facial landmark detector based on an ensemble of regression trees (Kazemi and Sullivan, 2014) that runs in approximately one millisecond.[13]

Deep learning approaches have extended face landmarking to 3D. The Face Alignment Network (FAN) by Bulat and Tzimiropoulos (2017) predicts 68 3D keypoints from a single image using a stacked hourglass architecture. Google's MediaPipe Face Mesh goes further, predicting 468 3D landmarks that approximate the full facial surface geometry. Face landmarks are used in applications such as face recognition, facial expression analysis, face tracking for [AR](/wiki/augmented_reality) filters, driver monitoring systems, and deepfake detection.

## 6DoF Object Pose Estimation

Beyond human and animal keypoints, a major branch of the field estimates the 6 degrees of freedom (6DoF) pose of rigid objects: their 3D rotation (3 DoF) plus 3D translation (3 DoF) in the camera coordinate frame, which together place a known object model in the scene.[18] The term "6DoF" is often abbreviated "6D." This is distinct from human keypoint estimation: instead of joints, the model recovers an object's full position and orientation, which is what a robot needs to grasp it.

Classical pipelines establish 2D-to-3D correspondences between image features and a known 3D model, then solve for the pose with a Perspective-n-Point (PnP) algorithm, often inside a RANSAC loop for robustness to outliers. Deep learning methods such as PoseCNN, PVNet, and DenseFusion learn these correspondences or regress pose directly from RGB or RGB-D input. 6DoF pose estimation is described in the literature as "one of the key technologies for robotic grasping," supporting industrial bin-picking, [augmented reality](/wiki/augmented_reality), and [autonomous driving](/wiki/autonomous_vehicle), because it lets a robot adapt to objects with complex geometry and arbitrary orientation, including partially occluded items in cluttered environments.[18]

## Datasets

Several benchmark datasets have driven progress in pose estimation research.

| Dataset | Year | Type | Annotations | Images/Sequences | Keypoints |
|---|---|---|---|---|---|
| MPII Human Pose | 2014 | 2D body | Single and multi-person | ~25,000 images, ~40,000 annotated people | 16 body joints |
| COCO Keypoints | 2014 | 2D body | Multi-person, keypoint visibility flags | 200,000+ images, 250,000+ person instances | 17 body keypoints |
| Human3.6M | 2014 | 3D body | Marker-based motion capture | 3.6 million frames, 11 actors, 17 scenarios | 32 joints (commonly subsampled to 17) |
| 300-W | 2013 | 2D/3D face | Facial landmarks in the wild | 600 images (indoor + outdoor) | 68 facial landmarks |
| FreiHAND | 2019 | 3D hand | Hand pose and shape from single RGB | 130,000+ samples | 21 hand keypoints |
| InterHand2.6M | 2020 | 3D hand | Two-hand interaction poses | 2.6 million frames | 21 keypoints per hand |
| AP-10K | 2022 | 2D animal | Animal pose across 23 families, 54 species | 10,015 images | 17 animal keypoints |
| WFLW | 2018 | 2D face | Facial landmarks with attribute annotations | 10,000 faces | 98 facial landmarks |

### COCO Keypoints

The [COCO](/wiki/coco_dataset) (Common Objects in Context) dataset is the most widely used benchmark for multi-person 2D pose estimation.[17] It defines 17 keypoints organized into three body regions: head (nose, left eye, right eye, left ear, right ear), upper body (left shoulder, right shoulder, left elbow, right elbow, left wrist, right wrist), and lower body (left hip, right hip, left knee, right knee, left ankle, right ankle). Each keypoint is annotated with its (x, y) position and a visibility flag indicating whether it is visible, occluded, or outside the image. The dataset is split into approximately 57,000 training images, 5,000 validation images, and 20,000 test-dev images.

### MPII Human Pose

The MPII Human Pose dataset, released by the Max Planck Institute for Informatics in 2014, contains approximately 25,000 images with over 40,000 annotated individuals.[15] It provides annotations for 16 body joints and includes activity labels. MPII was the primary benchmark for single-person pose estimation before COCO became dominant.[15]

### Human3.6M

Human3.6M is the largest and most commonly used dataset for 3D pose estimation. It contains 3.6 million video frames captured in a controlled indoor setting with a marker-based motion capture system.[14] Eleven professional actors (6 male, 5 female) perform 17 activities including walking, sitting, eating, discussion, and taking photos. The dataset provides synchronized multi-view video with accurate 3D joint positions, making it the standard benchmark for monocular 3D pose estimation methods.[14]

## How is pose estimation accuracy measured?

Pose estimation models are evaluated using several standard metrics, with Object Keypoint Similarity now serving as the dominant criterion on COCO.

### Percentage of Correct Keypoints (PCK)

PCK measures the fraction of predicted keypoints that fall within a specified distance threshold of the ground truth. The threshold is typically normalized by the size of the person. Common variants include:

- **PCK@0.2**: A keypoint is correct if the prediction is within 20% of the torso diameter.
- **PCKh@0.5**: A keypoint is correct if the prediction is within 50% of the head bone length. This variant is the standard metric on the MPII benchmark.

### Object Keypoint Similarity (OKS)

OKS is the standard metric for COCO keypoint evaluation. It is computed by passing the Euclidean distance between each predicted and ground-truth keypoint through an unnormalized Gaussian, with standard deviation s*k, where s is the object scale (the square root of the segmented area) and k is a per-keypoint constant that accounts for annotation variance.[17] OKS values range from 0 to 1, where 1 indicates a perfect match. The per-keypoint constants were tuned by measuring annotator standard deviation on roughly 5,000 redundantly annotated validation images, which is why OKS treats easy-to-annotate joints (such as shoulders) more strictly than ambiguous ones (such as hips). OKS is therefore more comprehensive than PCK because it accounts for both the scale of the person and the inherent difficulty of localizing each specific keypoint.

### Average Precision (AP)

AP on the COCO benchmark uses OKS as the matching criterion. A predicted pose is considered a true positive if its OKS with the corresponding ground-truth instance exceeds a given threshold. AP is computed by averaging precision at multiple recall levels, similar to the approach used in [object detection](/wiki/object_detection). The primary metrics reported include:

- **AP** (averaged over OKS thresholds from 0.50 to 0.95 in steps of 0.05)
- **AP@0.50** (at the lenient OKS threshold of 0.50)
- **AP@0.75** (at the strict OKS threshold of 0.75)
- **AR** (Average [Recall](/wiki/recall), the maximum recall given a fixed number of detections)

### Mean Per Joint Position Error (MPJPE)

MPJPE is the standard metric for 3D pose estimation, particularly on Human3.6M.[14] It measures the average Euclidean distance (in millimeters) between predicted and ground-truth 3D joint positions after aligning the root joint (typically the pelvis). A variant called P-MPJPE (Procrustes-aligned MPJPE) further applies rigid alignment (rotation, translation, and scaling) before computing the error.

## Comparison of Pose Estimation Models

| Model | Year | Approach | Backbone | COCO AP (test-dev) | Key Innovation |
|---|---|---|---|---|---|
| DeepPose | 2014 | Top-down, regression | AlexNet | N/A (pre-COCO benchmark) | First DNN-based pose estimation |
| Stacked Hourglass | 2016 | Top-down, heatmap | Hourglass modules | N/A (MPII benchmark) | Multi-scale feature processing with intermediate supervision |
| OpenPose | 2017 | Bottom-up, heatmap + PAFs | VGG-19 (first 10 layers) | Won COCO 2016 Keypoints Challenge | Part Affinity Fields for multi-person association |
| SimpleBaseline | 2018 | Top-down, heatmap | [ResNet](/wiki/resnet) | 73.7 (ResNet-152) | Simple deconvolution head proves effective |
| HRNet-W48 | 2019 | Top-down, heatmap | HRNet | ~75.5 | Maintains high-resolution representations throughout |
| HigherHRNet | 2020 | Bottom-up, heatmap | HRNet | 70.5 | Multi-resolution heatmap aggregation for bottom-up |
| ViTPose-H | 2022 | Top-down, heatmap | ViT-Huge | 79.1 | Plain vision transformer backbone |
| ViTPose-G | 2022 | Top-down, heatmap | ViTAE-G (1B params) | 80.9 | Scaling to 1 billion parameters |
| RTMPose-m | 2023 | Top-down, SimCC | CSPNeXt | 75.8 | Real-time with coordinate classification; 90+ FPS on CPU |
| RTMPose-l | 2023 | Top-down, SimCC | CSPNeXt | 76.3 | Larger variant with higher accuracy |

## What is pose estimation used for?

Pose estimation underpins a broad set of applications across sports, healthcare, entertainment, robotics, and surveillance.

### Sports Analytics

Pose estimation enables detailed biomechanical analysis of athletic performance without the need for expensive marker-based motion capture suits. Coaches and analysts can track joint angles, stride length, hip elevation during jumps, and other kinematic metrics from standard video. AI-powered systems can detect deviations in throwing or running mechanics that may indicate injury risk. Professional sports organizations use pose estimation for player tracking, technique comparison, and performance optimization.

### Healthcare and Rehabilitation

In clinical settings, pose estimation supports objective gait analysis, posture assessment, and range-of-motion measurement. Physical therapists can use vision-based systems to monitor patients performing rehabilitation exercises remotely, ensuring correct form and tracking recovery progress over time. This is particularly valuable for telehealth applications where in-person supervision is not feasible. Pose-based analysis has been applied to conditions such as Parkinson's disease, stroke recovery, scoliosis screening, and fall risk assessment in elderly populations.

### Augmented and Virtual Reality

Pose estimation is a core technology for [AR](/wiki/augmented_reality) and [VR](/wiki/virtual_reality) experiences. Body tracking allows virtual avatars to mirror the user's movements in real time. Hand pose estimation enables natural hand-based interaction with virtual objects, replacing the need for handheld controllers. Face landmark detection powers AR filters and face effects on platforms such as Snapchat and Instagram. Meta's Quest headsets, Apple Vision Pro, and other XR devices rely on pose estimation for body, hand, and eye tracking.

### Robotics and Teleoperation

Pose estimation lets a robot perceive both objects and human operators. 6DoF object pose feeds robotic grasping and bin-picking, while human pose estimation drives teleoperation and imitation learning for [humanoid robots](/wiki/humanoid_robot). The H2O (Human2Humanoid) system, introduced in 2024, achieved the first learning-based real-time whole-body humanoid teleoperation using only an RGB camera and a pose estimator to capture the operator's motion, which a sim-to-real reinforcement learning policy then mimics on a full-sized robot.[19] Such pipelines turn ordinary video of human motion into robot-executable actions, a key ingredient for scaling up demonstration data for embodied AI.

### Sign Language Recognition

Pose estimation provides a privacy-preserving, person-independent representation for sign language recognition systems. By extracting skeleton keypoints from the signer's body, hands, and face, the system converts visual information into a compact, structured format that can be processed by sequence models such as [LSTMs](/wiki/rnn) or [transformers](/wiki/attention). Research has demonstrated that pose-based features can achieve competitive accuracy for both isolated sign recognition and continuous sign language translation. MediaPipe Holistic, which simultaneously detects body, hand, and face landmarks, has become a popular front-end for sign language recognition pipelines.

### Fitness and Wellness Applications

Consumer fitness applications use pose estimation to count repetitions, measure joint angles during exercises, and provide real-time feedback on exercise form. Products such as Apple Fitness+ and various AI coaching apps leverage on-device pose models to guide users through workouts without additional hardware. These systems can detect common form errors (for example, knees caving inward during squats) and alert users to reduce injury risk.

### Animation and Motion Capture

Traditional motion capture for film and video game production requires actors to wear suits with reflective markers and perform in specialized studios with many calibrated cameras. Deep learning-based pose estimation has enabled markerless motion capture, where 3D poses are estimated directly from standard video footage. Tools like Move.ai, Plask, and Rokoko Video use neural networks to convert ordinary video into 3D animation data. This democratizes motion capture by eliminating the need for expensive hardware, though marker-based systems still offer higher precision for productions with strict accuracy requirements.

### Action Recognition and Surveillance

Skeleton-based [action recognition](/wiki/action_recognition) uses pose estimation as a preprocessing step, extracting keypoint sequences that are then classified into activity categories. Because skeleton representations are compact and abstract away appearance details (clothing, background, lighting), they generalize well across different environments. Applications include monitoring for falls in elderly care facilities, detecting aggressive behavior in public spaces, and understanding human activities in autonomous driving scenarios.

## Animal Pose Estimation

Animal pose estimation extends keypoint detection to non-human species, supporting research in neuroscience, ethology, ecology, and veterinary medicine. The field faces unique challenges compared to human pose: animals exhibit far greater morphological diversity, annotated datasets are smaller, and occlusion patterns differ significantly.

### DeepLabCut

DeepLabCut, developed by Mathis et al. (2018) and published in Nature Neuroscience, is the most widely adopted framework for animal pose estimation.[10] Originally built on top of DeeperCut (a human pose model), it allows researchers to train custom keypoint detectors from a small number of labeled frames.[10] DeepLabCut supports multi-animal pose estimation and tracking, enabling studies of social behavior and group dynamics. The SuperAnimal models introduced in DeepLabCut 3.0 provide pretrained foundation models that work across over 45 species without additional manual labeling and are 10 to 100 times more data-efficient than prior transfer learning approaches when fine-tuned.

### AP-10K Dataset

The AP-10K dataset (2022) is a benchmark for animal pose estimation containing 10,015 images spanning 23 animal families and 54 species.[16] It defines 17 keypoints per animal, following a general quadruped skeleton.[16] AP-10K has been integrated into DeepLabCut and other toolkits for benchmarking cross-species pose models.

ViTPose++ extended the ViTPose framework to animal pose estimation, demonstrating that transformer-based architectures trained on large-scale human data can transfer effectively to animal keypoint detection with appropriate fine-tuning.[7]

## Challenges and Future Directions

Despite significant progress, several open challenges remain in pose estimation:

- **Occlusion handling**: When keypoints are heavily occluded by other people, objects, or self-occlusion, current models still struggle to produce accurate predictions. Incorporating temporal information from video and reasoning about body structure can help.
- **Domain generalization**: Models trained on standard benchmarks may not transfer well to novel domains such as underwater footage, infrared imaging, or extreme camera angles. Domain adaptation and self-supervised learning are active areas of research.
- **Whole-body pose estimation**: Predicting body, hand, and face keypoints simultaneously in a unified model remains challenging due to the large scale difference between body joints and fine-grained hand/face landmarks. COCO-WholeBody and RTMW address this by defining over 130 keypoints.
- **Real-time 3D pose**: While 2D pose estimation runs in real time on commodity hardware, accurate monocular 3D pose estimation at comparable speeds remains difficult, especially for multi-person scenarios.
- **Privacy concerns**: Pose estimation in surveillance and public spaces raises ethical questions about tracking and identification. Skeleton-based representations can be re-identified in some cases, and responsible deployment requires careful consideration of privacy implications.

## References

1. Toshev, A. and Szegedy, C. "DeepPose: Human Pose Estimation via Deep Neural Networks." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
2. Newell, A., Yang, K., and Deng, J. "Stacked Hourglass Networks for Human Pose Estimation." European Conference on Computer Vision (ECCV), 2016.
3. Cao, Z., Simon, T., Wei, S.-E., and Sheikh, Y. "Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. arXiv:1611.08050.
4. Cao, Z., Hidalgo, G., Simon, T., Wei, S.-E., and Sheikh, Y. "OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields." IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 1, pp. 172-186, 2021. arXiv:1812.08008.
5. Sun, K., Xiao, B., Liu, D., and Wang, J. "Deep High-Resolution Representation Learning for Human Pose Estimation." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 5693-5703.
6. Xu, Y., Zhang, J., Zhang, Q., and Tao, D. "ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation." Advances in Neural Information Processing Systems (NeurIPS), 2022. arXiv:2204.12484.
7. Xu, Y., Zhang, J., Zhang, Q., and Tao, D. "ViTPose++: Vision Transformer for Generic Body Pose Estimation." IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023.
8. Jiang, T., Lu, P., Zhang, L., Ma, N., Han, R., Lyu, C., Li, Y., and Chen, K. "RTMPose: Real-Time Multi-Person Pose Estimation Based on MMPose." arXiv preprint arXiv:2303.07399, 2023.
9. Bazarevsky, V., Grishchenko, I., Raveendran, K., Zhu, T., Zhang, F., and Grundmann, M. "BlazePose: On-device Real-time Body Pose Tracking." CVPR Workshop on Computer Vision for Augmented and Virtual Reality, 2020. arXiv:2006.10204.
10. Mathis, A., Mamidanna, P., Cury, K.M., Abe, T., Murthy, V.N., Mathis, M.W., and Bethge, M. "DeepLabCut: Markerless Pose Estimation of User-Defined Body Parts with Deep Learning." Nature Neuroscience, vol. 21, pp. 1281-1289, 2018.
11. Martinez, J., Hossain, R., Romero, J., and Little, J.J. "A Simple Yet Effective Baseline for 3D Human Pose Estimation." Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017.
12. Xiao, B., Wu, H., and Wei, Y. "Simple Baselines for Human Pose Estimation and Tracking." European Conference on Computer Vision (ECCV), 2018.
13. Kazemi, V. and Sullivan, J. "One Millisecond Face Alignment with an Ensemble of Regression Trees." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
14. Ionescu, C., Papava, D., Olaru, V., and Sminchisescu, C. "Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments." IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 7, pp. 1325-1339, 2014.
15. Andriluka, M., Pishchulin, L., Gehler, P., and Schiele, B. "2D Human Pose Estimation: New Benchmark and State of the Art Analysis." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
16. Ye, Q., Xu, W., Yan, Y., Wang, Q., and Liu, Y. "AP-10K: A Benchmark for Animal Pose Estimation in the Wild." Advances in Neural Information Processing Systems (NeurIPS), 2022.
17. Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., and Zitnick, C.L. "Microsoft COCO: Common Objects in Context." European Conference on Computer Vision (ECCV), 2014. Keypoint evaluation (OKS): https://cocodataset.org/#keypoints-eval
18. Guan, J. et al. "A Survey of 6DoF Object Pose Estimation Methods for Different Application Scenarios." Sensors, vol. 24, no. 4, 1076, 2024. https://www.mdpi.com/1424-8220/24/4/1076
19. He, T., Luo, Z., Xiao, W., Zhang, C., Kitani, K., Liu, C., and Shi, G. "Learning Human-to-Humanoid Real-Time Whole-Body Teleoperation (H2O)." arXiv preprint arXiv:2403.04436, 2024. https://human2humanoid.com/

