Pose estimation is a computer vision task that involves detecting and localizing anatomical keypoints (also called landmarks or joints) of a human body, hand, or face in images and video. Given an input image, a pose estimation system outputs a set of 2D or 3D coordinates representing the positions of predefined keypoints such as elbows, wrists, knees, or facial features. The resulting skeleton representation captures the spatial configuration of a person's body and can be used to analyze movement, recognize actions, and drive downstream applications ranging from sports analytics to augmented reality.
Pose estimation has become one of the most actively researched problems in computer vision since the introduction of deep learning-based methods in 2014. Modern models achieve high accuracy in real time, enabling deployment on mobile devices and edge hardware. This article covers the fundamentals of 2D and 3D pose estimation, key architectural paradigms, landmark models, evaluation metrics, major datasets, and applications.
2D pose estimation aims to predict the pixel coordinates (x, y) of each keypoint from a single RGB image. The output is a 2D skeleton that overlays the original image. Most modern approaches use a convolutional neural network (CNN) or vision transformer (ViT) backbone to extract features from the input image, followed by a decoder head that produces either heatmaps or direct coordinate predictions.
The dominant paradigm in 2D pose estimation represents each keypoint as a 2D Gaussian heatmap. The network outputs one heatmap per keypoint, where the intensity at each spatial location reflects the confidence that the keypoint is located there. The final keypoint position is obtained by finding the peak (argmax) of each heatmap. This approach was popularized by Tompson et al. (2014) and became the standard after the success of the Stacked Hourglass network (Newell et al., 2016).
An alternative approach directly regresses the (x, y) coordinates of each keypoint from image features. DeepPose (Toshev and Szegedy, 2014) was the first work to formulate pose estimation as a direct regression problem using deep neural networks. While regression methods are simpler, heatmap-based methods generally achieve higher accuracy because heatmaps preserve spatial information and are easier to optimize. However, recent coordinate classification methods such as SimCC have narrowed this gap.
3D pose estimation extends the task to predict (x, y, z) coordinates, recovering the depth dimension that is lost in 2D projections. There are two primary strategies:
Lifting-based methods first run a 2D pose estimator to detect keypoints, then use a separate network to "lift" the 2D skeleton into 3D space. This two-stage approach leverages the strength of mature 2D detectors and the availability of large 2D training datasets. Martinez et al. (2017) demonstrated that a simple fully connected network could achieve competitive 3D pose accuracy when given accurate 2D keypoints as input.
End-to-end methods directly predict 3D joint positions from raw images without an intermediate 2D step. These methods typically require training data with ground-truth 3D annotations, which are more expensive to acquire and often collected using multi-camera motion capture systems.
3D pose estimation has additional challenges, including depth ambiguity (multiple 3D configurations can produce the same 2D projection), self-occlusion, and the limited availability of in-the-wild 3D annotated data.
When multiple people appear in an image, the system must detect all individuals and assign keypoints to the correct person. Two paradigms address this challenge.
The top-down approach first uses a person detector (such as Faster R-CNN or YOLO) to locate bounding boxes around each individual, then applies a single-person pose estimator independently within each cropped region. This paradigm generally achieves higher accuracy because the pose model can focus on one person at a time with normalized scale. The drawback is that runtime scales linearly with the number of people in the image, since every detected person requires a separate forward pass through the pose network. Top-down methods also depend heavily on the quality of the person detector; missed detections or false positives directly affect pose estimation results.
The bottom-up approach detects all keypoints in the image simultaneously, regardless of which person they belong to, and then groups them into individual skeletons using association algorithms. OpenPose pioneered this paradigm with Part Affinity Fields (PAFs), which encode the direction and location of limb connections. Because the network processes the entire image in a single forward pass, runtime is largely independent of the number of people, making bottom-up methods more efficient in crowded scenes. However, the grouping step is more challenging, and bottom-up methods have traditionally been less accurate than top-down methods, especially for small or occluded persons.
The following sections describe several influential models that have shaped the development of pose estimation.
DeepPose, proposed by Alexander Toshev and Christian Szegedy at Google, was the first work to apply deep neural networks to human pose estimation. Published at CVPR 2014, it formulated pose estimation as a DNN-based regression problem, using a cascade of regressors built on AlexNet to predict joint coordinates directly. While its accuracy has since been surpassed, DeepPose demonstrated that end-to-end learning with deep networks could replace hand-crafted features for pose estimation.
Alexi Newell, Kaiyu Yang, and Jia Deng introduced the Stacked Hourglass Network at ECCV 2016. The architecture features a series of "hourglass" modules that repeatedly downsample and upsample feature maps, capturing information at multiple scales. Intermediate supervision at each hourglass stage enables the network to progressively refine its predictions. The design achieved state-of-the-art results on the MPII and FLIC benchmarks and established heatmap regression as the dominant paradigm for 2D pose estimation.
OpenPose, developed by Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh at Carnegie Mellon University, was the first real-time multi-person 2D pose estimation system using a bottom-up approach. The method introduces Part Affinity Fields (PAFs), 2D vector fields that encode the location and orientation of limbs, to associate detected keypoints with specific individuals.
The architecture uses the first 10 layers of a pretrained VGG-19 network as a feature extractor. The extracted features pass through a multi-stage CNN with two branches: one branch iteratively refines PAFs for limb association, while the other refines confidence maps (heatmaps) for keypoint detection. A greedy bipartite matching algorithm then assembles the detected keypoints and limb associations into complete skeletons.
OpenPose won the inaugural COCO 2016 Keypoints Challenge and significantly outperformed prior methods on the MPII Multi-Person benchmark. The open-source OpenPose library supports detection of body, foot, hand, and facial keypoints, making it one of the most widely adopted pose estimation tools in research and industry.
Deep High-Resolution Representation Learning for Human Pose Estimation, known as HRNet, was proposed by Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang at CVPR 2019. Unlike conventional architectures that encode the input into low-resolution representations and then recover high-resolution features through upsampling (as in the Stacked Hourglass or encoder-decoder designs), HRNet maintains a high-resolution representation throughout the entire network.
The architecture starts with a single high-resolution subnetwork and gradually adds parallel lower-resolution subnetworks. These multi-resolution branches exchange information through repeated multi-scale fusion, where features from different resolutions are aggregated. This design preserves fine-grained spatial details that are critical for precise keypoint localization.
HRNet-W48 achieves approximately 75.5 AP on the COCO test-dev set (single model, standard input size). Variants of HRNet (HRNet-W32, HRNet-W48) have been widely adopted as backbones for numerous vision tasks beyond pose estimation, including semantic segmentation and object detection.
ViTPose, published at NeurIPS 2022 by Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao, demonstrated that plain, non-hierarchical vision transformers can serve as strong backbones for pose estimation. The model uses a standard ViT backbone for feature extraction paired with a lightweight decoder.
ViTPose's key strength is scalability. The model can be configured from approximately 100 million to over 1 billion parameters. The largest variant, ViTPose-G, achieves 80.9 AP on the COCO test-dev set as a single model, setting a new state-of-the-art record at the time of publication. The follow-up work ViTPose++ (TPAMI 2023) extended the framework to generic body pose estimation across humans and animals.
RTMPose, developed by Tao Jiang, Peng Lu, Li Zhang, and colleagues at OpenMMLab, is a high-performance real-time pose estimation framework built on top of MMPose. The work systematically explores factors affecting the speed-accuracy tradeoff, including model architecture, training strategy, and deployment optimization.
RTMPose adopts a top-down paradigm with a CSPNeXt backbone (originally designed for object detection) and a SimCC-based head that treats keypoint localization as a coordinate classification problem rather than heatmap regression. Instead of predicting 2D Gaussian heatmaps, SimCC discretizes the coordinate space into bins and classifies horizontal and vertical positions separately, using Gaussian label smoothing for training.
RTMPose-m achieves 75.8% AP on COCO while running at over 90 FPS on an Intel i7-11700 CPU and over 430 FPS on an NVIDIA GTX 1660 Ti GPU. The smallest variant, RTMPose-s, achieves 72.2% AP on COCO with over 70 FPS on a Snapdragon 865 mobile chip, making it practical for edge deployment.
MediaPipe Pose, developed by Google, is an on-device ML solution for real-time body pose tracking. It uses a two-step detector-tracker pipeline: a lightweight detector first locates the person region of interest, and a tracker then predicts 33 body landmarks with 3D coordinates (x, y, z) and a visibility score. Because the tracker operates on a tight crop around the previously detected pose, it avoids running the detector on every frame, resulting in efficient real-time performance on mobile devices.
MediaPipe extends beyond body pose through a suite of related solutions:
MediaPipe's cross-platform support (Android, iOS, web, Python) and lightweight inference have made it one of the most accessible pose estimation solutions for application developers.
Hand pose estimation is a specialized subtask focused on detecting the 3D positions of finger joints and fingertips. A typical hand skeleton consists of 21 keypoints: one at the wrist and four per finger (metacarpophalangeal, proximal interphalangeal, distal interphalangeal, and tip joints). Hand pose estimation is significantly more challenging than body pose due to frequent self-occlusion between fingers, the small size of hands relative to the image, and the high degree of articulation with over 20 degrees of freedom.
Datasets for hand pose include the FreiHAND dataset, the Rendered Hand Pose Dataset (RHD), InterHand2.6M, and AssemblyHands (which contains 3.0 million annotated images for egocentric 3D hand pose). Models such as MediaPipe Hands, HaMeR (which uses a large-scale Vision Transformer), and various regression-based approaches have advanced hand keypoint detection for applications in gesture recognition, AR/VR interaction, robotic manipulation, and sign language understanding.
Face landmark detection identifies keypoints on facial features such as the eyebrows, eyes, nose, mouth contour, and jawline. The standard annotation scheme uses 68 landmarks as defined in the iBUG 300-W dataset. Dlib, a widely used C++ library, provides a pretrained 68-point facial landmark detector based on an ensemble of regression trees (Kazemi and Sullivan, 2014) that runs in approximately one millisecond.
Deep learning approaches have extended face landmarking to 3D. The Face Alignment Network (FAN) by Bulat and Tzimiropoulos (2017) predicts 68 3D keypoints from a single image using a stacked hourglass architecture. Google's MediaPipe Face Mesh goes further, predicting 468 3D landmarks that approximate the full facial surface geometry. Face landmarks are used in applications such as face recognition, facial expression analysis, face tracking for AR filters, driver monitoring systems, and deepfake detection.
Several benchmark datasets have driven progress in pose estimation research.
| Dataset | Year | Type | Annotations | Images/Sequences | Keypoints |
|---|---|---|---|---|---|
| MPII Human Pose | 2014 | 2D body | Single and multi-person | ~25,000 images, ~40,000 annotated people | 16 body joints |
| COCO Keypoints | 2014 | 2D body | Multi-person, keypoint visibility flags | 200,000+ images, 250,000+ person instances | 17 body keypoints |
| Human3.6M | 2014 | 3D body | Marker-based motion capture | 3.6 million frames, 11 actors, 17 scenarios | 32 joints (commonly subsampled to 17) |
| 300-W | 2013 | 2D/3D face | Facial landmarks in the wild | 600 images (indoor + outdoor) | 68 facial landmarks |
| FreiHAND | 2019 | 3D hand | Hand pose and shape from single RGB | 130,000+ samples | 21 hand keypoints |
| InterHand2.6M | 2020 | 3D hand | Two-hand interaction poses | 2.6 million frames | 21 keypoints per hand |
| AP-10K | 2022 | 2D animal | Animal pose across 23 families, 54 species | 10,015 images | 17 animal keypoints |
| WFLW | 2018 | 2D face | Facial landmarks with attribute annotations | 10,000 faces | 98 facial landmarks |
The COCO (Common Objects in Context) dataset is the most widely used benchmark for multi-person 2D pose estimation. It defines 17 keypoints organized into three body regions: head (nose, left eye, right eye, left ear, right ear), upper body (left shoulder, right shoulder, left elbow, right elbow, left wrist, right wrist), and lower body (left hip, right hip, left knee, right knee, left ankle, right ankle). Each keypoint is annotated with its (x, y) position and a visibility flag indicating whether it is visible, occluded, or outside the image. The dataset is split into approximately 57,000 training images, 5,000 validation images, and 20,000 test-dev images.
The MPII Human Pose dataset, released by the Max Planck Institute for Informatics in 2014, contains approximately 25,000 images with over 40,000 annotated individuals. It provides annotations for 16 body joints and includes activity labels. MPII was the primary benchmark for single-person pose estimation before COCO became dominant.
Human3.6M is the largest and most commonly used dataset for 3D pose estimation. It contains 3.6 million video frames captured in a controlled indoor setting with a marker-based motion capture system. Eleven professional actors (6 male, 5 female) perform 17 activities including walking, sitting, eating, discussion, and taking photos. The dataset provides synchronized multi-view video with accurate 3D joint positions, making it the standard benchmark for monocular 3D pose estimation methods.
Pose estimation models are evaluated using several standard metrics.
PCK measures the fraction of predicted keypoints that fall within a specified distance threshold of the ground truth. The threshold is typically normalized by the size of the person. Common variants include:
OKS is the standard metric for COCO keypoint evaluation. It computes the similarity between predicted and ground-truth keypoints by measuring the Euclidean distance between each pair, normalized by the scale of the person instance (bounding box area) and a per-keypoint constant that accounts for annotation variance. OKS values range from 0 to 1, where 1 indicates a perfect match. OKS is more comprehensive than PCK because it accounts for both the scale of the person and the inherent difficulty of localizing each specific keypoint (for example, shoulder keypoints are typically easier to annotate precisely than hip keypoints).
AP on the COCO benchmark uses OKS as the matching criterion. A predicted pose is considered a true positive if its OKS with the corresponding ground-truth instance exceeds a given threshold. AP is computed by averaging precision at multiple recall levels, similar to the approach used in object detection. The primary metrics reported include:
MPJPE is the standard metric for 3D pose estimation, particularly on Human3.6M. It measures the average Euclidean distance (in millimeters) between predicted and ground-truth 3D joint positions after aligning the root joint (typically the pelvis). A variant called P-MPJPE (Procrustes-aligned MPJPE) further applies rigid alignment (rotation, translation, and scaling) before computing the error.
| Model | Year | Approach | Backbone | COCO AP (test-dev) | Key Innovation |
|---|---|---|---|---|---|
| DeepPose | 2014 | Top-down, regression | AlexNet | N/A (pre-COCO benchmark) | First DNN-based pose estimation |
| Stacked Hourglass | 2016 | Top-down, heatmap | Hourglass modules | N/A (MPII benchmark) | Multi-scale feature processing with intermediate supervision |
| OpenPose | 2017 | Bottom-up, heatmap + PAFs | VGG-19 (first 10 layers) | Won COCO 2016 Keypoints Challenge | Part Affinity Fields for multi-person association |
| SimpleBaseline | 2018 | Top-down, heatmap | ResNet | 73.7 (ResNet-152) | Simple deconvolution head proves effective |
| HRNet-W48 | 2019 | Top-down, heatmap | HRNet | ~75.5 | Maintains high-resolution representations throughout |
| HigherHRNet | 2020 | Bottom-up, heatmap | HRNet | 70.5 | Multi-resolution heatmap aggregation for bottom-up |
| ViTPose-H | 2022 | Top-down, heatmap | ViT-Huge | 79.1 | Plain vision transformer backbone |
| ViTPose-G | 2022 | Top-down, heatmap | ViTAE-G (1B params) | 80.9 | Scaling to 1 billion parameters |
| RTMPose-m | 2023 | Top-down, SimCC | CSPNeXt | 75.8 | Real-time with coordinate classification; 90+ FPS on CPU |
| RTMPose-l | 2023 | Top-down, SimCC | CSPNeXt | 76.3 | Larger variant with higher accuracy |
Pose estimation enables detailed biomechanical analysis of athletic performance without the need for expensive marker-based motion capture suits. Coaches and analysts can track joint angles, stride length, hip elevation during jumps, and other kinematic metrics from standard video. AI-powered systems can detect deviations in throwing or running mechanics that may indicate injury risk. Professional sports organizations use pose estimation for player tracking, technique comparison, and performance optimization.
In clinical settings, pose estimation supports objective gait analysis, posture assessment, and range-of-motion measurement. Physical therapists can use vision-based systems to monitor patients performing rehabilitation exercises remotely, ensuring correct form and tracking recovery progress over time. This is particularly valuable for telehealth applications where in-person supervision is not feasible. Pose-based analysis has been applied to conditions such as Parkinson's disease, stroke recovery, scoliosis screening, and fall risk assessment in elderly populations.
Pose estimation is a core technology for AR and VR experiences. Body tracking allows virtual avatars to mirror the user's movements in real time. Hand pose estimation enables natural hand-based interaction with virtual objects, replacing the need for handheld controllers. Face landmark detection powers AR filters and face effects on platforms such as Snapchat and Instagram. Meta's Quest headsets, Apple Vision Pro, and other XR devices rely on pose estimation for body, hand, and eye tracking.
Pose estimation provides a privacy-preserving, person-independent representation for sign language recognition systems. By extracting skeleton keypoints from the signer's body, hands, and face, the system converts visual information into a compact, structured format that can be processed by sequence models such as LSTMs or transformers. Research has demonstrated that pose-based features can achieve competitive accuracy for both isolated sign recognition and continuous sign language translation. MediaPipe Holistic, which simultaneously detects body, hand, and face landmarks, has become a popular front-end for sign language recognition pipelines.
Consumer fitness applications use pose estimation to count repetitions, measure joint angles during exercises, and provide real-time feedback on exercise form. Products such as Apple Fitness+ and various AI coaching apps leverage on-device pose models to guide users through workouts without additional hardware. These systems can detect common form errors (for example, knees caving inward during squats) and alert users to reduce injury risk.
Traditional motion capture for film and video game production requires actors to wear suits with reflective markers and perform in specialized studios with many calibrated cameras. Deep learning-based pose estimation has enabled markerless motion capture, where 3D poses are estimated directly from standard video footage. Tools like Move.ai, Plask, and Rokoko Video use neural networks to convert ordinary video into 3D animation data. This democratizes motion capture by eliminating the need for expensive hardware, though marker-based systems still offer higher precision for productions with strict accuracy requirements.
Skeleton-based action recognition uses pose estimation as a preprocessing step, extracting keypoint sequences that are then classified into activity categories. Because skeleton representations are compact and abstract away appearance details (clothing, background, lighting), they generalize well across different environments. Applications include monitoring for falls in elderly care facilities, detecting aggressive behavior in public spaces, and understanding human activities in autonomous driving scenarios.
Animal pose estimation extends keypoint detection to non-human species, supporting research in neuroscience, ethology, ecology, and veterinary medicine. The field faces unique challenges compared to human pose: animals exhibit far greater morphological diversity, annotated datasets are smaller, and occlusion patterns differ significantly.
DeepLabCut, developed by Mathis et al. (2018) and published in Nature Neuroscience, is the most widely adopted framework for animal pose estimation. Originally built on top of DeeperCut (a human pose model), it allows researchers to train custom keypoint detectors from a small number of labeled frames. DeepLabCut supports multi-animal pose estimation and tracking, enabling studies of social behavior and group dynamics. The SuperAnimal models introduced in DeepLabCut 3.0 provide pretrained foundation models that work across over 45 species without additional manual labeling and are 10 to 100 times more data-efficient than prior transfer learning approaches when fine-tuned.
The AP-10K dataset (2022) is a benchmark for animal pose estimation containing 10,015 images spanning 23 animal families and 54 species. It defines 17 keypoints per animal, following a general quadruped skeleton. AP-10K has been integrated into DeepLabCut and other toolkits for benchmarking cross-species pose models.
ViTPose++ extended the ViTPose framework to animal pose estimation, demonstrating that transformer-based architectures trained on large-scale human data can transfer effectively to animal keypoint detection with appropriate fine-tuning.
Despite significant progress, several open challenges remain in pose estimation: