OpenPose

Computer Vision Open Source AI

21 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

14 citations

Revision

v2 · 4,276 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

OpenPose is an open-source library for real-time multi-person 2D pose estimation that detects body, foot, hand, and facial keypoints in images and video. Developed at the Carnegie Mellon University (CMU) Perceptual Computing Lab, it was, in the words of its maintainers, "the first real-time multi-person system to jointly detect human body, hand, facial, and foot keypoints (in total 135 keypoints) on single images." ^[1]^[6] The system was introduced in the CVPR 2017 paper "Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields" by Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh, and extended in the 2019 IEEE TPAMI journal version with Gines Hidalgo as additional co-author. ^[1]^[2]

The central technical contribution is the Part Affinity Field (PAF), described in the paper as "a set of 2D vector fields that encode the location and orientation of limbs over the image domain." ^[1] PAFs let the system parse multiple people from a single forward pass through a convolutional neural network without first running a person detector, which is the standard step in top-down approaches. As the authors state, "this bottom-up system achieves high accuracy and realtime performance, regardless of the number of people in the image." ^[1] Because the runtime of the body detection stage is essentially independent of the number of people in the scene, OpenPose can keep up with crowded video at frame rates that scale gracefully where top-down systems slow down with each additional person. ^[1]

The project is hosted at github.com/CMU-Perceptual-Computing-Lab/openpose, written mostly in C++ and CUDA, with Python bindings, a Caffe back end, ROS integration, and binaries for Windows, Ubuntu, and macOS. ^[6] It has been cited tens of thousands of times and remains a standard reference for bottom-up pose estimation, even after later systems such as HRNet, PifPaf, BlazePose, MoveNet, and ViTPose surpassed it on individual axes of speed or accuracy. ^[7]^[8]^[9]^[10]^[11]

What is OpenPose?

OpenPose is a free, open-source toolkit from CMU that takes an RGB image or video and returns the 2D skeleton coordinates of every person it can find, including detailed hand and face landmarks. ^[6] It was the first published system to deliver simultaneous, real-time detection of body, foot, hand, and facial keypoints for multiple people in a single image, with up to 135 keypoints per person when the body/foot, two-hand, and face models are run together. ^[1]^[6] Its defining design choice is a bottom-up parsing strategy built on Part Affinity Fields, which makes its runtime nearly invariant to crowd size, in contrast to top-down methods whose cost grows with each additional person. ^[1]

When was OpenPose released and who built it?

The OpenPose lineage traces directly back to the Convolutional Pose Machines (CPM) paper by Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh, published at CVPR 2016. ^[3] CPM showed that a sequential convolutional neural network could implicitly model long-range dependencies between joints by stacking stages that operate on belief maps from previous stages, producing increasingly refined heatmaps for each body part without an explicit graphical model. CPM achieved state-of-the-art accuracy on the MPII, LSP, and FLIC pose benchmarks, but it was a single-person method. ^[3]

The step from single-person CPM to multi-person OpenPose came in late 2016. Cao, Simon, Wei, and Sheikh posted the first multi-person paper to arXiv as preprint 1611.08050 in November 2016, then presented it as an oral at CVPR 2017 in Honolulu under the title "Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields." ^[1] That same year, Simon, Hanbyul Joo, Iain Matthews, and Sheikh introduced the hand keypoint detector via multiview bootstrapping at CVPR 2017, using the CMU Panoptic Studio to generate 3D-triangulated training labels for hand poses across hundreds of synchronized cameras. ^[4] The combination of the body and hand pipelines, plus an integrated face keypoint model and a foot dataset annotated by the CMU group, became the OpenPose system released publicly in mid-2017. ^[6]

The extended journal version appeared on arXiv in December 2018 as 1812.08008 and was published in IEEE Transactions on Pattern Analysis and Machine Intelligence in 2019 (TPAMI Vol. 43, Issue 1, pp. 172-186). ^[2] The TPAMI version adds Gines Hidalgo as fifth author, introduces the BODY_25 model that integrates body and foot keypoints into a single network, replaces simultaneous body part and PAF refinement with a PAF-only refinement strategy that runs faster, and reports "the first combined body and foot keypoint detector, based on an internal annotated foot dataset." ^[2]

Who are the OpenPose authors?

The OpenPose authors all worked under Yaser Sheikh's Perceptual Computing Lab at the Robotics Institute, CMU. ^[6] Sheikh later served as research lead at Facebook Reality Labs in Pittsburgh, where many of the same people moved on to work on photorealistic avatars and codec avatars. Tomas Simon's PhD thesis covered the multiview bootstrapping work that became the OpenPose hand pipeline. ^[4] Hanbyul Joo, who built the Panoptic Studio dome at CMU, is a co-author on the hand paper. ^[4]^[5] Zhe Cao led the original CVPR 2017 multi-person paper, and Gines Hidalgo joined as the engineer who wrote and maintained much of the production C++ codebase. ^[1]^[2]

What are Part Affinity Fields?

The Part Affinity Field is a 2D vector field defined over the image plane, one field per limb type, described in the paper as "a set of 2D vector fields that encode the location and orientation of limbs over the image domain." ^[1] For a given limb that connects two body parts (for instance, the right elbow to the right wrist), the PAF is non-zero at every pixel that lies along the line segment between the two endpoints, and zero elsewhere. Where the field is non-zero, its value is the unit vector pointing from the start joint to the end joint. This representation gives the network a way to vote for an entire limb's existence and orientation, instead of voting only for joint locations and then guessing how to connect them. ^[1]

At inference time, the system runs two operations. First, it extracts candidate body part locations by finding peaks in the per-part confidence heatmaps. Then, for every pair of candidate parts that could form a valid limb, it integrates the PAF along the line segment between them and computes the dot product of the field with the limb direction. A high integral means the field strongly supports a limb between those two candidates. The result is a per-edge weight that feeds into a bipartite matching problem. ^[1]

Rather than solve the full multi-person assembly as a global integer linear program (which is NP-hard), the system relaxes it to a sequence of independent bipartite matchings, one per limb type. Each matching is solved greedily using the Hungarian algorithm, then the limbs are merged into person assemblies by the joints they share. This relaxation is exact under mild assumptions about which limbs share parts in a tree-shaped skeleton, and in practice it costs a small fraction of the network forward pass. ^[1]

How does OpenPose work?

The original CVPR 2017 architecture is a two-branch, multi-stage convolutional network. ^[1] The first 10 layers of VGG-19 (pretrained on ImageNet) extract feature maps F from the input image. F is then fed in parallel to two branches: one branch predicts a set of part confidence maps S (one heatmap per body part), and the other predicts a set of part affinity fields L (one 2D vector field per limb). Each branch runs over multiple stages, with each stage taking as input both F and the predictions from the previous stage. At every stage, the network applies an L2 loss between the predicted S and L and the ground-truth heatmaps and PAFs. The total loss is the sum of the per-stage losses, which provides intermediate supervision and helps avoid vanishing gradients in deep networks. ^[1]

The original COCO model produces 38 PAF channels (19 limb types times 2 components per vector field) and 19 confidence map channels (18 body parts plus a background channel). The MPII variant uses different channel counts to match the MPII keypoint set. ^[1]

The TPAMI 2019 architecture changes two things. ^[2] First, instead of refining both branches in parallel, the network refines only the PAF branch across multiple stages, and only when the PAFs have stabilized does it compute the confidence maps in a single stage at the end. The authors observed empirically that PAFs benefit much more from iterative refinement than confidence maps, and that running fewer confidence-map stages improves both accuracy and runtime. Second, the BODY_25 model integrates body and foot keypoints into the same network so that one forward pass produces 25 keypoint heatmaps instead of 18, eliminating a separate model for foot estimation. ^[2]

What keypoint formats does OpenPose output?

OpenPose supports multiple keypoint configurations. The body output format is selected at runtime, and the hand and face models are independent and can be enabled together for a 135-keypoint whole-body output. ^[6]

Format	Keypoints	Notes
BODY_25	25	Default and recommended; includes 6 foot keypoints (big toe, small toe, heel for each foot) plus the standard COCO body joints
COCO	18	Original MS COCO keypoint set extended with the neck joint
MPI	15	MPII Human Pose joint layout; least accurate but fastest on CPU
MPI_4_layers	15	Reduced-depth MPI variant for lower-end hardware
Hand	21 per hand	Wrist plus 4 keypoints per finger; trained via multiview bootstrapping
Face	70	Eyes, eyebrows, nose, mouth, jawline; integrated face model

The BODY_25 model is recommended by the maintainers because it is faster than the COCO model on GPU and is the only configuration with foot keypoints, which matter for downstream applications like gait analysis and full-body motion capture. ^[6] The COCO 18-keypoint format is the original output used in the CVPR 2017 paper. ^[1]

What datasets was OpenPose trained on?

OpenPose models were trained and evaluated against three primary datasets, plus a custom foot dataset released by the CMU group. ^[1]^[2]

Dataset	Keypoints	Scale	Use in OpenPose
MS COCO Keypoints	17 body	~150,000 person instances over 200,000+ images	Primary training and evaluation set; basis for the COCO 2016 keypoint challenge
MPII Human Pose	16 body	~25,000 images, 40,000 person instances	Used to evaluate the multi-person variant; OpenPose set state of the art on MPII Multi-Person
CMU Panoptic Studio	up to 200,000+ frames	480 VGA + 31 HD synchronized cameras	Source of multiview-bootstrapped hand training data and 3D supervision
CMU Foot Dataset	6 foot keypoints	~14,000 person instances annotated on COCO	Released alongside the TPAMI paper for BODY_25 training

The Panoptic Studio is a geodesic dome at CMU described in the ICCV 2015 paper by Hanbyul Joo and colleagues. ^[5] It contains 480 VGA cameras (640x480 at 25 fps), 31 HD cameras (1920x1080 at 30 fps), and 10 Kinect II sensors. The dome was specifically designed to capture social interactions in unconstrained motion, with cameras dense enough that occluded body parts are nearly always visible from at least one viewpoint. ^[5] The hand keypoint detector training procedure exploits this multiview redundancy: an initial weak detector produces noisy 2D labels in each camera, and triangulation rejects views where the 2D detection disagrees with the 3D-reconstructed point. The agreed-upon 3D points are reprojected into every camera and used as labels in the next training iteration. After several iterations the detector improves enough to label hand keypoints reliably in single RGB images at runtime. ^[4]

How fast is OpenPose?

The CVPR 2017 paper reported the following numbers on the MS COCO keypoints test-dev set. ^[1]

Method	AP	Approach
OpenPose (CMU-Pose)	61.8	Bottom-up
Mask R-CNN	62.7	Top-down
G-RMI	60.5	Top-down

On the MPII Multi-Person test set, the same paper reported 75.6 mAP, well above the prior state of the art at the time. OpenPose won the inaugural COCO 2016 keypoint challenge in the bottom-up category. ^[1]

Runtime numbers in the CVPR 2017 paper, measured on a single Nvidia GeForce Titan X (Pascal) GPU at 1080p input resampled to network resolution, show that the body model runs at roughly 8.8 frames per second on multi-person images, and that the runtime is essentially flat with respect to the number of people in the frame because the network forward pass dominates and the bipartite matching step is negligible. ^[1] Top-down methods like Mask R-CNN, by contrast, run a separate single-person pose estimator per detected person, so their runtime grows linearly with the number of people. For a scene with 19 people, the OpenPose paper reports that bottom-up parsing was about 6 times faster than top-down on the same hardware. ^[1]

Later hardware and architecture revisions changed these numbers. On modern desktop GPUs the BODY_25 model runs at about 22 fps for a single person on a Titan Xp and somewhere between 1 and 5 fps when hand and face models are enabled at full resolution, depending on input size. ^[6] The CPU-only build is much slower (typically below 1 fps for full-body inference), which is why MediaPipe BlazePose and MoveNet became preferred for mobile and on-device use cases. ^[9]^[14]

How is OpenPose implemented?

The production OpenPose codebase is C++ with CUDA kernels, with Caffe used as the back-end deep learning framework for the network forward pass. ^[6] The repository ships with a command-line demo that reads from images, video files, webcams, IP cameras, or Flir machine vision cameras, and writes keypoint output to JSON, XML, or YML files alongside rendered visualizations as PNG, JPG, or AVI. A Python wrapper exposes the same functionality for scripting, and a separate maintained ROS package provides integration into robotics pipelines. ^[6]

The repository supports CUDA on Nvidia GPUs, OpenCL on AMD GPUs, and a CPU-only fallback. ^[6] Build instructions cover Ubuntu, Windows, and macOS, plus Nvidia Jetson TX2 for embedded use. The default models are downloaded as part of the build process from CMU servers. Because the network is implemented in Caffe, GPU memory requirements for the full BODY_25 plus hand plus face configuration are substantial; the documentation lists 4 GB of GPU memory as a minimum and recommends 8 GB or more for stable inference at high resolution. ^[6]

A notable feature is single-person 3D triangulation. If the user runs OpenPose on synchronized video from multiple calibrated cameras, the library can triangulate the 2D keypoints across views to produce a 3D pose. ^[6] This is separate from the much more capable multi-person 3D pipelines built on top of the Panoptic Studio, which use OpenPose 2D keypoints from each of the dome's cameras. ^[5]

Is OpenPose free to use?

OpenPose is released under a non-commercial research license. Academic and personal use is free, and the software "may be redistributed under these conditions." ^[6] Any commercial use requires a separate paid license through CMU's FlintBox technology transfer office. ^[6] This licensing model has been a recurring point of friction for industry users: companies that want to ship products containing OpenPose either pay the CMU license fee, run a parallel internal reimplementation, or switch to permissively licensed alternatives such as MediaPipe (Apache 2.0), MoveNet (Apache 2.0), AlphaPose (academic, with separate commercial terms), or YOLO-Pose (GPL-3.0 inherited from Ultralytics YOLOv5). ^[9]^[10]^[14] The license terms are one of the main reasons OpenPose itself is rarely embedded in shipping consumer products even though the algorithms it pioneered are widely used.

How does OpenPose compare to HRNet, BlazePose, and ViTPose?

OpenPose was the leading bottom-up pose estimator from 2017 to roughly 2019. Since then, several systems have surpassed it on different axes: top-down methods like HRNet and ViTPose lead in pure accuracy, mobile methods like BlazePose and MoveNet are dramatically faster on phones, and end-to-end YOLO-style architectures collapse detection and pose into a single pass. ^[7]^[9]^[11]

Method	Year	Type	Key trait	Where it beats OpenPose
OpenPose	2017	Bottom-up, PAF	First real-time multi-person system	(baseline)
AlphaPose / RMPE	2017	Top-down, SSTN	First open-source top-down system at 70+ AP on COCO	Higher accuracy on isolated persons
HRNet	2019	Top-down	Maintains high-resolution features through the whole network	75.5 AP vs 61.8 on COCO
PifPaf	2019	Bottom-up, PIF + PAF	Composite fields with Laplace loss	Better in low-resolution and occluded scenes
BlazePose	2020	Top-down, MobileNet	33-keypoint single-person model for mobile	25-75x faster than OpenPose at similar AR/fitness accuracy
MoveNet	2021	CenterNet-based	Lightning and Thunder variants for edge devices	30+ FPS on phones; runs in browser via TF.js
YOLO-Pose	2022	End-to-end YOLO	Joint object and keypoint detection in one pass	90.3% AP50 on COCO test-dev with no test-time augmentation
ViTPose	2022	Top-down, ViT	Plain vision transformer backbone	80.9 AP on COCO test-dev (single model)
ViTPose++	2023	Top-down, ViT	Generic body pose with knowledge token transfer	State of the art across multiple pose tasks
DWPose	2023	Top-down, distilled	Two-stage distillation for whole-body keypoints	66.5 AP on COCO-WholeBody

Reference pointers for the comparison table: HRNet ^[7], PifPaf ^[8], RMPE/AlphaPose ^[9], BlazePose ^[10], YOLO-Pose ^[11], ViTPose ^[12], DWPose ^[13], MoveNet ^[14].

The lasting influence of OpenPose comes less from its raw accuracy numbers (which are now well behind the leaders) and more from the Part Affinity Field idea and the bottom-up parsing strategy, both of which were copied or adapted by many follow-on systems including PifPaf and HigherHRNet. ^[8] The pre-trained OpenPose body model is also the most common skeleton extractor used as input to ControlNet pose-conditioned image generators, which keeps it in active use long after it stopped being the most accurate pose estimator.

What is OpenPose used for?

OpenPose has been deployed across a wide range of computer vision and human-motion analysis settings. Some of the better-documented application areas are listed below.

Domain	Use of OpenPose
Sports analytics	Player tracking and biomechanical analysis in basketball, baseball, football, tennis, and cycling; broadcasters and franchises use OpenPose-style keypoints to derive metrics like joint angles and stride lengths
Healthcare and rehabilitation	Gait analysis, balance assessment, range-of-motion tracking for physical therapy, and screening for movement disorders; cited in clinical studies as a low-cost markerless alternative to lab motion capture
Human-computer interaction	Gesture recognition, virtual try-on, body-driven avatars, dance and fitness apps
Motion capture for animation	Markerless mocap for indie animators and VFX; combined with multiview rigs to drive 3D character animation
Surveillance and security	Crowd density estimation, fall detection, fight detection, and unusual behavior recognition in public spaces
Sign language recognition	Hand and finger keypoints used as input to sequence models trained on continuous sign language; 21-keypoint hand model is well suited to fingerspelling
Augmented reality	Body-tracking filters and effects on platforms like Snapchat, TikTok, and Instagram; mostly via OpenPose-derived techniques rather than the OpenPose binary itself due to licensing
Driver monitoring	Detecting drowsiness, distraction, and abnormal posture from in-cabin cameras
Industrial ergonomics	Posture monitoring and ergonomic risk assessment for assembly-line and warehouse workers
AI image generation	Pose conditioning input to ControlNet-style diffusion models, where an OpenPose skeleton constrains the pose of a generated character

What are the limitations of OpenPose?

OpenPose has several well-documented weaknesses that motivated the wave of follow-up systems.

It produces only 2D keypoints from a single camera. Recovering 3D pose requires either multiple calibrated cameras (Panoptic-style triangulation) or a separate 2D-to-3D lifting model trained on datasets like Human3.6M. ^[5] The library does ship a single-person triangulation utility for multi-camera setups, but multi-person 3D from a single view is not supported. ^[6]

Occlusion remains hard. Severe self-occlusion, person-on-person occlusion in crowded scenes, and tightly-packed limbs cause the bottom-up parser to merge or drop keypoints. PifPaf was specifically designed to address this case, and it outperformed OpenPose by a wide margin on crowded benchmarks. ^[8]

The non-commercial license is a barrier to industry adoption. Companies that want to ship pose estimation in a consumer product typically choose MediaPipe BlazePose or MoveNet, both of which are Apache-licensed. ^[10]^[14]

Accuracy is no longer competitive at the top of the leaderboard. HRNet and ViTPose substantially exceed OpenPose on the COCO keypoint AP metric, and DWPose leads on the whole-body benchmark. ^[7]^[12]^[13]

Speed is no longer competitive on mobile or edge devices. BlazePose and MoveNet run in real time on phones, while OpenPose's full BODY_25 plus hand plus face pipeline still requires a discrete GPU. ^[10]^[14] The Lightweight OpenPose project (Daniil Osokin, 2018) reduces the network to make it tractable on CPU, but it sacrifices accuracy.

GPU memory requirements are high. The full whole-body configuration with hand and face models needs at least 4 GB of GPU memory and ideally 8 GB or more for stable high-resolution inference. ^[6]

What is the legacy and influence of OpenPose?

The OpenPose papers have been cited tens of thousands of times across both the academic and applied literature. ^[1]^[2] The Part Affinity Field idea was directly extended by PifPaf, which substituted Laplace-distributed composite fields, and indirectly inspired several other bottom-up grouping strategies including associative embedding (Newell et al., 2017) and HigherHRNet. ^[8] The Convolutional Pose Machines stage-wise refinement pattern that OpenPose inherited from Wei et al. 2016 became standard practice in heatmap-based pose estimation. ^[3]

Beyond the academic citations, OpenPose's pre-trained body skeletons are now embedded in commercial workflows that the original authors did not anticipate. The most visible example is ControlNet for Stable Diffusion (Zhang et al., 2023), where OpenPose-format skeletons are the most popular pose-conditioning input for generating images of a character in a specified pose. This means that even users who have never run OpenPose directly often use its keypoint format as the de facto standard for skeleton specifications in 2D image generation.

The broader Perceptual Computing Lab effort that produced OpenPose also produced the Panoptic Studio, the multiview hand keypoint dataset, the Total Capture project for full-body markerless motion capture, and the Monocular Total Capture system. ^[5] Many of these contributions feed into the current generation of avatar and codec avatar work at Meta Reality Labs, and into newer whole-body human models such as Meta's Sapiens foundation models for human-centric vision tasks.

ELI5: OpenPose explained simply

Imagine a computer watching a video of a crowded room and instantly drawing a stick figure on top of every person, including little dots on their fingers and faces. That is what OpenPose does. Most older programs would first find each person, cut them out one at a time, and then draw the stick figure, which gets slow when there are lots of people. OpenPose instead looks at the whole picture at once. It guesses where all the elbows, knees, wrists, and noses are, and it also guesses the "glue" (the Part Affinity Fields) that says which elbow belongs to which wrist for the same body. Then it snaps the dots together into complete stick figures. Because it does this all in one look, adding more people barely slows it down.

References

Cao, Z., Simon, T., Wei, S.-E., and Sheikh, Y. (2017). "Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields." CVPR 2017 (Oral). arXiv:1611.08050. https://openaccess.thecvf.com/content_cvpr_2017/papers/Cao_Realtime_Multi-Person_2D_CVPR_2017_paper.pdf ↩
Cao, Z., Hidalgo, G., Simon, T., Wei, S.-E., and Sheikh, Y. (2019). "OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields." IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 43, Issue 1, pp. 172-186. arXiv:1812.08008. https://arxiv.org/abs/1812.08008 ↩
Wei, S.-E., Ramakrishna, V., Kanade, T., and Sheikh, Y. (2016). "Convolutional Pose Machines." CVPR 2016. arXiv:1602.00134. https://openaccess.thecvf.com/content_cvpr_2016/papers/Wei_Convolutional_Pose_Machines_CVPR_2016_paper.pdf ↩
Simon, T., Joo, H., Matthews, I., and Sheikh, Y. (2017). "Hand Keypoint Detection in Single Images using Multiview Bootstrapping." CVPR 2017. arXiv:1704.07809. https://openaccess.thecvf.com/content_cvpr_2017/papers/Simon_Hand_Keypoint_Detection_CVPR_2017_paper.pdf ↩
Joo, H., et al. (2015). "Panoptic Studio: A Massively Multiview System for Social Motion Capture." ICCV 2015. https://openaccess.thecvf.com/content_iccv_2015/papers/Joo_Panoptic_Studio_A_ICCV_2015_paper.pdf ↩
CMU-Perceptual-Computing-Lab. "OpenPose: Real-time multi-person keypoint detection library for body, face, hands, and foot estimation." GitHub repository. https://github.com/CMU-Perceptual-Computing-Lab/openpose ↩
Sun, K., Xiao, B., Liu, D., and Wang, J. (2019). "Deep High-Resolution Representation Learning for Human Pose Estimation" (HRNet). CVPR 2019. https://openaccess.thecvf.com/content_CVPR_2019/papers/Sun_Deep_High-Resolution_Representation_Learning_for_Human_Pose_Estimation_CVPR_2019_paper.pdf ↩
Kreiss, S., Bertoni, L., and Alahi, A. (2019). "PifPaf: Composite Fields for Human Pose Estimation." CVPR 2019. arXiv:1903.06593. https://openaccess.thecvf.com/content_CVPR_2019/papers/Kreiss_PifPaf_Composite_Fields_for_Human_Pose_Estimation_CVPR_2019_paper.pdf ↩
Fang, H.-S., Xie, S., Tai, Y.-W., and Lu, C. (2017). "RMPE: Regional Multi-Person Pose Estimation" (AlphaPose). ICCV 2017. https://openaccess.thecvf.com/content_ICCV_2017/papers/Fang_RMPE_Regional_Multi-Person_ICCV_2017_paper.pdf ↩
Bazarevsky, V., Grishchenko, I., Raveendran, K., Zhu, T., Zhang, F., and Grundmann, M. (2020). "BlazePose: On-device Real-time Body Pose tracking." CV4ARVR Workshop, CVPR 2020. arXiv:2006.10204. https://arxiv.org/abs/2006.10204 ↩
Maji, D., Nagori, S., Mathew, M., and Poddar, D. (2022). "YOLO-Pose: Enhancing YOLO for Multi Person Pose Estimation Using Object Keypoint Similarity Loss." CVPRW 2022. https://openaccess.thecvf.com/content/CVPR2022W/ECV/papers/Maji_YOLO-Pose_Enhancing_YOLO_for_Multi_Person_Pose_Estimation_Using_Object_CVPRW_2022_paper.pdf ↩
Xu, Y., Zhang, J., Zhang, Q., and Tao, D. (2022). "ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation." NeurIPS 2022. arXiv:2204.12484. https://arxiv.org/abs/2204.12484 ↩
Yang, Z., et al. (2023). "Effective Whole-body Pose Estimation with Two-stages Distillation" (DWPose). ICCV 2023 Workshop. arXiv:2307.15880. https://arxiv.org/abs/2307.15880 ↩
Google Research (2021). "Next-Generation Pose Detection with MoveNet and TensorFlow.js." https://blog.tensorflow.org/2021/05/next-generation-pose-detection-with-movenet-and-tensorflowjs.html ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

ControlNet Keypoints Landmarks Pose estimation

What is OpenPose?

When was OpenPose released and who built it?

Who are the OpenPose authors?

What are Part Affinity Fields?

How does OpenPose work?

What keypoint formats does OpenPose output?

What datasets was OpenPose trained on?

How fast is OpenPose?

How is OpenPose implemented?

Is OpenPose free to use?

How does OpenPose compare to HRNet, BlazePose, and ViTPose?

What is OpenPose used for?

What are the limitations of OpenPose?

What is the legacy and influence of OpenPose?

ELI5: OpenPose explained simply

References

Improve this article

Related Articles

DeepSeek-OCR

Wan 2.1-VACE

Rerun (rerun.io)

V-JEPA 2

V-JEPA

olmOCR

What links here

Related Articles

DeepSeek-OCR

Wan 2.1-VACE

Rerun (rerun.io)

V-JEPA 2

V-JEPA

olmOCR

What links here