OpenPose
Last reviewed
Apr 30, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,814 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 30, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,814 words
Add missing citations, update stale details, or suggest a clearer explanation.
OpenPose is an open-source library for real-time multi-person 2D pose estimation that detects body, foot, hand, and facial keypoints in images and video. Developed at the Carnegie Mellon University Perceptual Computing Lab, it was the first published work to demonstrate simultaneous detection of body, foot, hand, and facial keypoints in a single image, and one of the earliest systems to deliver real-time multi-person 2D pose estimation on consumer GPUs. The system was introduced in the CVPR 2017 paper "Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields" by Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh, and extended in the 2019 IEEE TPAMI journal version with Gines Hidalgo as additional co-author.
The central technical contribution is the Part Affinity Field (PAF), a 2D vector field representation that encodes both the location and orientation of limbs across the image. PAFs let the system parse multiple people from a single forward pass through a convolutional neural network without first running a person detector, which is the standard step in top-down approaches. Because the runtime of the body detection stage is essentially independent of the number of people in the scene, OpenPose can keep up with crowded video at frame rates that scale gracefully where top-down systems slow down with each additional person.
The project is hosted at github.com/CMU-Perceptual-Computing-Lab/openpose, written mostly in C++ and CUDA, with Python bindings, a Caffe back end, ROS integration, and binaries for Windows, Ubuntu, and macOS. It has been cited tens of thousands of times and remains a standard reference for bottom-up pose estimation, even after later systems such as HRNet, PifPaf, BlazePose, MoveNet, and ViTPose surpassed it on individual axes of speed or accuracy.
The OpenPose lineage traces directly back to the Convolutional Pose Machines (CPM) paper by Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh, published at CVPR 2016. CPM showed that a sequential convolutional neural network could implicitly model long-range dependencies between joints by stacking stages that operate on belief maps from previous stages, producing increasingly refined heatmaps for each body part without an explicit graphical model. CPM achieved state-of-the-art accuracy on the MPII, LSP, and FLIC pose benchmarks, but it was a single-person method.
The step from single-person CPM to multi-person OpenPose came in late 2016. Cao, Simon, Wei, and Sheikh posted the first multi-person paper to arXiv as preprint 1611.08050 in November 2016, then presented it as an oral at CVPR 2017 in Honolulu under the title "Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields." That same year, Simon, Hanbyul Joo, Iain Matthews, and Sheikh introduced the hand keypoint detector via multiview bootstrapping at CVPR 2017, using the CMU Panoptic Studio to generate 3D-triangulated training labels for hand poses across hundreds of synchronized cameras. The combination of the body and hand pipelines, plus an integrated face keypoint model and a foot dataset annotated by the CMU group, became the OpenPose system released publicly in mid-2017.
The extended journal version appeared on arXiv in December 2018 as 1812.08008 and was published in IEEE Transactions on Pattern Analysis and Machine Intelligence in 2019 (TPAMI Vol. 43, Issue 1, pp. 172-186). The TPAMI version adds Gines Hidalgo as fifth author, introduces the BODY_25 model that integrates body and foot keypoints into a single network, replaces simultaneous body part and PAF refinement with a PAF-only refinement strategy that runs faster, and reports the first combined body and foot keypoint detector along with a newly released foot dataset.
The OpenPose authors all worked under Yaser Sheikh's Perceptual Computing Lab at the Robotics Institute, CMU. Sheikh later served as research lead at Facebook Reality Labs in Pittsburgh, where many of the same people moved on to work on photorealistic avatars and codec avatars. Tomas Simon's PhD thesis covered the multiview bootstrapping work that became the OpenPose hand pipeline. Hanbyul Joo, who built the Panoptic Studio dome at CMU, is a co-author on the hand paper. Zhe Cao led the original CVPR 2017 multi-person paper, and Gines Hidalgo joined as the engineer who wrote and maintained much of the production C++ codebase.
The Part Affinity Field is a 2D vector field defined over the image plane, one field per limb type. For a given limb that connects two body parts (for instance, the right elbow to the right wrist), the PAF is non-zero at every pixel that lies along the line segment between the two endpoints, and zero elsewhere. Where the field is non-zero, its value is the unit vector pointing from the start joint to the end joint. This representation gives the network a way to vote for an entire limb's existence and orientation, instead of voting only for joint locations and then guessing how to connect them.
At inference time, the system runs two operations. First, it extracts candidate body part locations by finding peaks in the per-part confidence heatmaps. Then, for every pair of candidate parts that could form a valid limb, it integrates the PAF along the line segment between them and computes the dot product of the field with the limb direction. A high integral means the field strongly supports a limb between those two candidates. The result is a per-edge weight that feeds into a bipartite matching problem.
Rather than solve the full multi-person assembly as a global integer linear program (which is NP-hard), the system relaxes it to a sequence of independent bipartite matchings, one per limb type. Each matching is solved greedily using the Hungarian algorithm, then the limbs are merged into person assemblies by the joints they share. This relaxation is exact under mild assumptions about which limbs share parts in a tree-shaped skeleton, and in practice it costs a small fraction of the network forward pass.
The original CVPR 2017 architecture is a two-branch, multi-stage convolutional network. The first 10 layers of VGG-19 (pretrained on ImageNet) extract feature maps F from the input image. F is then fed in parallel to two branches: one branch predicts a set of part confidence maps S (one heatmap per body part), and the other predicts a set of part affinity fields L (one 2D vector field per limb). Each branch runs over multiple stages, with each stage taking as input both F and the predictions from the previous stage. At every stage, the network applies an L2 loss between the predicted S and L and the ground-truth heatmaps and PAFs. The total loss is the sum of the per-stage losses, which provides intermediate supervision and helps avoid vanishing gradients in deep networks.
The original COCO model produces 38 PAF channels (19 limb types times 2 components per vector field) and 19 confidence map channels (18 body parts plus a background channel). The MPII variant uses different channel counts to match the MPII keypoint set.
The TPAMI 2019 architecture changes two things. First, instead of refining both branches in parallel, the network refines only the PAF branch across multiple stages, and only when the PAFs have stabilized does it compute the confidence maps in a single stage at the end. The authors observed empirically that PAFs benefit much more from iterative refinement than confidence maps, and that running fewer confidence-map stages improves both accuracy and runtime. Second, the BODY_25 model integrates body and foot keypoints into the same network so that one forward pass produces 25 keypoint heatmaps instead of 18, eliminating a separate model for foot estimation.
OpenPose supports multiple keypoint configurations. The body output format is selected at runtime, and the hand and face models are independent and can be enabled together for a 135-keypoint whole-body output.
| Format | Keypoints | Notes |
|---|---|---|
| BODY_25 | 25 | Default and recommended; includes 6 foot keypoints (big toe, small toe, heel for each foot) plus the standard COCO body joints |
| COCO | 18 | Original MS COCO keypoint set extended with the neck joint |
| MPI | 15 | MPII Human Pose joint layout; least accurate but fastest on CPU |
| MPI_4_layers | 15 | Reduced-depth MPI variant for lower-end hardware |
| Hand | 21 per hand | Wrist plus 4 keypoints per finger; trained via multiview bootstrapping |
| Face | 70 | Eyes, eyebrows, nose, mouth, jawline; integrated face model |
The BODY_25 model is recommended by the maintainers because it is faster than the COCO model on GPU and is the only configuration with foot keypoints, which matter for downstream applications like gait analysis and full-body motion capture. The COCO 18-keypoint format is the original output used in the CVPR 2017 paper.
OpenPose models were trained and evaluated against three primary datasets, plus a custom foot dataset released by the CMU group.
| Dataset | Keypoints | Scale | Use in OpenPose |
|---|---|---|---|
| MS COCO Keypoints | 17 body | ~150,000 person instances over 200,000+ images | Primary training and evaluation set; basis for the COCO 2016 keypoint challenge |
| MPII Human Pose | 16 body | ~25,000 images, 40,000 person instances | Used to evaluate the multi-person variant; OpenPose set state of the art on MPII Multi-Person |
| CMU Panoptic Studio | up to 200,000+ frames | 480 VGA + 31 HD synchronized cameras | Source of multiview-bootstrapped hand training data and 3D supervision |
| CMU Foot Dataset | 6 foot keypoints | ~14,000 person instances annotated on COCO | Released alongside the TPAMI paper for BODY_25 training |
The Panoptic Studio is a geodesic dome at CMU described in the ICCV 2015 paper by Hanbyul Joo and colleagues. It contains 480 VGA cameras (640x480 at 25 fps), 31 HD cameras (1920x1080 at 30 fps), and 10 Kinect II sensors. The dome was specifically designed to capture social interactions in unconstrained motion, with cameras dense enough that occluded body parts are nearly always visible from at least one viewpoint. The hand keypoint detector training procedure exploits this multiview redundancy: an initial weak detector produces noisy 2D labels in each camera, and triangulation rejects views where the 2D detection disagrees with the 3D-reconstructed point. The agreed-upon 3D points are reprojected into every camera and used as labels in the next training iteration. After several iterations the detector improves enough to label hand keypoints reliably in single RGB images at runtime.
The CVPR 2017 paper reported the following numbers on the MS COCO keypoints test-dev set:
| Method | AP | Approach |
|---|---|---|
| OpenPose (CMU-Pose) | 61.8 | Bottom-up |
| Mask R-CNN | 62.7 | Top-down |
| G-RMI | 60.5 | Top-down |
On the MPII Multi-Person test set, the same paper reported 75.6 mAP, well above the prior state of the art at the time. OpenPose won the inaugural COCO 2016 keypoint challenge in the bottom-up category.
Runtime numbers in the CVPR 2017 paper, measured on a single Nvidia GeForce Titan X (Pascal) GPU at 1080p input resampled to network resolution, show that the body model runs at roughly 8.8 frames per second on multi-person images, and that the runtime is essentially flat with respect to the number of people in the frame because the network forward pass dominates and the bipartite matching step is negligible. Top-down methods like Mask R-CNN, by contrast, run a separate single-person pose estimator per detected person, so their runtime grows linearly with the number of people. For a scene with 19 people, the OpenPose paper reports that bottom-up parsing was about 6 times faster than top-down on the same hardware.
Later hardware and architecture revisions changed these numbers. On modern desktop GPUs the BODY_25 model runs at about 22 fps for a single person on a Titan Xp and somewhere between 1 and 5 fps when hand and face models are enabled at full resolution, depending on input size. The CPU-only build is much slower (typically below 1 fps for full-body inference), which is why MediaPipe BlazePose and MoveNet became preferred for mobile and on-device use cases.
The production OpenPose codebase is C++ with CUDA kernels, with Caffe used as the back-end deep learning framework for the network forward pass. The repository ships with a command-line demo that reads from images, video files, webcams, IP cameras, or Flir machine vision cameras, and writes keypoint output to JSON, XML, or YML files alongside rendered visualizations as PNG, JPG, or AVI. A Python wrapper exposes the same functionality for scripting, and a separate maintained ROS package provides integration into robotics pipelines.
The repository supports CUDA on Nvidia GPUs, OpenCL on AMD GPUs, and a CPU-only fallback. Build instructions cover Ubuntu, Windows, and macOS, plus Nvidia Jetson TX2 for embedded use. The default models are downloaded as part of the build process from CMU servers. Because the network is implemented in Caffe, GPU memory requirements for the full BODY_25 plus hand plus face configuration are substantial; the documentation lists 4 GB of GPU memory as a minimum and recommends 8 GB or more for stable inference at high resolution.
A notable feature is single-person 3D triangulation. If the user runs OpenPose on synchronized video from multiple calibrated cameras, the library can triangulate the 2D keypoints across views to produce a 3D pose. This is separate from the much more capable multi-person 3D pipelines built on top of the Panoptic Studio, which use OpenPose 2D keypoints from each of the dome's cameras.
OpenPose is released under a non-commercial research license. Academic and personal use is free. Any commercial use requires a separate paid license through CMU's FlintBox technology transfer office. This licensing model has been a recurring point of friction for industry users: companies that want to ship products containing OpenPose either pay the CMU license fee, run a parallel internal reimplementation, or switch to permissively licensed alternatives such as MediaPipe (Apache 2.0), MoveNet (Apache 2.0), AlphaPose (academic, with separate commercial terms), or YOLO-Pose (GPL-3.0 inherited from Ultralytics YOLOv5). The license terms are one of the main reasons OpenPose itself is rarely embedded in shipping consumer products even though the algorithms it pioneered are widely used.
OpenPose was the leading bottom-up pose estimator from 2017 to roughly 2019. Since then, several systems have surpassed it on different axes: top-down methods like HRNet and ViTPose lead in pure accuracy, mobile methods like BlazePose and MoveNet are dramatically faster on phones, and end-to-end YOLO-style architectures collapse detection and pose into a single pass.
| Method | Year | Type | Key trait | Where it beats OpenPose |
|---|---|---|---|---|
| OpenPose | 2017 | Bottom-up, PAF | First real-time multi-person system | (baseline) |
| AlphaPose / RMPE | 2017 | Top-down, SSTN | First open-source top-down system at 70+ AP on COCO | Higher accuracy on isolated persons |
| HRNet | 2019 | Top-down | Maintains high-resolution features through the whole network | 75.5 AP vs 61.8 on COCO |
| PifPaf | 2019 | Bottom-up, PIF + PAF | Composite fields with Laplace loss | Better in low-resolution and occluded scenes |
| BlazePose | 2020 | Top-down, MobileNet | 33-keypoint single-person model for mobile | 25-75x faster than OpenPose at similar AR/fitness accuracy |
| MoveNet | 2021 | CenterNet-based | Lightning and Thunder variants for edge devices | 30+ FPS on phones; runs in browser via TF.js |
| YOLO-Pose | 2022 | End-to-end YOLO | Joint object and keypoint detection in one pass | 90.3% AP50 on COCO test-dev with no test-time augmentation |
| ViTPose | 2022 | Top-down, ViT | Plain vision transformer backbone | 80.9 AP on COCO test-dev (single model) |
| ViTPose++ | 2023 | Top-down, ViT | Generic body pose with knowledge token transfer | State of the art across multiple pose tasks |
| DWPose | 2023 | Top-down, distilled | Two-stage distillation for whole-body keypoints | 66.5 AP on COCO-WholeBody |
The lasting influence of OpenPose comes less from its raw accuracy numbers (which are now well behind the leaders) and more from the Part Affinity Field idea and the bottom-up parsing strategy, both of which were copied or adapted by many follow-on systems including PifPaf and HigherHRNet. The pre-trained OpenPose body model is also the most common skeleton extractor used as input to ControlNet pose-conditioned image generators, which keeps it in active use long after it stopped being the most accurate pose estimator.
OpenPose has been deployed across a wide range of computer vision and human-motion analysis settings. Some of the better-documented application areas are listed below.
| Domain | Use of OpenPose |
|---|---|
| Sports analytics | Player tracking and biomechanical analysis in basketball, baseball, football, tennis, and cycling; broadcasters and franchises use OpenPose-style keypoints to derive metrics like joint angles and stride lengths |
| Healthcare and rehabilitation | Gait analysis, balance assessment, range-of-motion tracking for physical therapy, and screening for movement disorders; cited in clinical studies as a low-cost markerless alternative to lab motion capture |
| Human-computer interaction | Gesture recognition, virtual try-on, body-driven avatars, dance and fitness apps |
| Motion capture for animation | Markerless mocap for indie animators and VFX; combined with multiview rigs to drive 3D character animation |
| Surveillance and security | Crowd density estimation, fall detection, fight detection, and unusual behavior recognition in public spaces |
| Sign language recognition | Hand and finger keypoints used as input to sequence models trained on continuous sign language; 21-keypoint hand model is well suited to fingerspelling |
| Augmented reality | Body-tracking filters and effects on platforms like Snapchat, TikTok, and Instagram; mostly via OpenPose-derived techniques rather than the OpenPose binary itself due to licensing |
| Driver monitoring | Detecting drowsiness, distraction, and abnormal posture from in-cabin cameras |
| Industrial ergonomics | Posture monitoring and ergonomic risk assessment for assembly-line and warehouse workers |
| AI image generation | Pose conditioning input to ControlNet-style diffusion models, where an OpenPose skeleton constrains the pose of a generated character |
OpenPose has several well-documented weaknesses that motivated the wave of follow-up systems.
It produces only 2D keypoints from a single camera. Recovering 3D pose requires either multiple calibrated cameras (Panoptic-style triangulation) or a separate 2D-to-3D lifting model trained on datasets like Human3.6M. The library does ship a single-person triangulation utility for multi-camera setups, but multi-person 3D from a single view is not supported.
Occlusion remains hard. Severe self-occlusion, person-on-person occlusion in crowded scenes, and tightly-packed limbs cause the bottom-up parser to merge or drop keypoints. PifPaf was specifically designed to address this case, and it outperformed OpenPose by a wide margin on crowded benchmarks.
The non-commercial license is a barrier to industry adoption. Companies that want to ship pose estimation in a consumer product typically choose MediaPipe BlazePose or MoveNet, both of which are Apache-licensed.
Accuracy is no longer competitive at the top of the leaderboard. HRNet and ViTPose substantially exceed OpenPose on the COCO keypoint AP metric, and DWPose leads on the whole-body benchmark.
Speed is no longer competitive on mobile or edge devices. BlazePose and MoveNet run in real time on phones, while OpenPose's full BODY_25 plus hand plus face pipeline still requires a discrete GPU. The Lightweight OpenPose project (Daniil Osokin, 2018) reduces the network to make it tractable on CPU, but it sacrifices accuracy.
GPU memory requirements are high. The full whole-body configuration with hand and face models needs at least 4 GB of GPU memory and ideally 8 GB or more for stable high-resolution inference.
The OpenPose papers have been cited tens of thousands of times across both the academic and applied literature. The Part Affinity Field idea was directly extended by PifPaf, which substituted Laplace-distributed composite fields, and indirectly inspired several other bottom-up grouping strategies including associative embedding (Newell et al., 2017) and HigherHRNet. The Convolutional Pose Machines stage-wise refinement pattern that OpenPose inherited from Wei et al. 2016 became standard practice in heatmap-based pose estimation.
Beyond the academic citations, OpenPose's pre-trained body skeletons are now embedded in commercial workflows that the original authors did not anticipate. The most visible example is ControlNet for Stable Diffusion (Zhang et al., 2023), where OpenPose-format skeletons are the most popular pose-conditioning input for generating images of a character in a specified pose. This means that even users who have never run OpenPose directly often use its keypoint format as the de facto standard for skeleton specifications in 2D image generation.
The broader Perceptual Computing Lab effort that produced OpenPose also produced the Panoptic Studio, the multiview hand keypoint dataset, the Total Capture project for full-body markerless motion capture, and the Monocular Total Capture system. Many of these contributions feed into the current generation of avatar and codec avatar work at Meta Reality Labs.