Sapiens (computer vision)
Last reviewed
Jun 3, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 1,363 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 1,363 words
Add missing citations, update stale details, or suggest a clearer explanation.
Sapiens is a family of human-centric computer vision foundation models developed by Meta AI (Reality Labs). The models target four core tasks defined over images of people: 2D human pose estimation, body-part segmentation, depth estimation, and surface-normal prediction. Sapiens models are Vision Transformers pretrained by masked autoencoding on a curated dataset of roughly 300 million in-the-wild human images, and they run natively at 1K (1024-pixel) resolution. The work was presented as an oral paper at the European Conference on Computer Vision (ECCV) 2024, where it was also named an award candidate. [1][2][3]
The accompanying paper, "Sapiens: Foundation for Human Vision Models," was authored by Rawal Khirodkar, Timur Bagautdinov, Julieta Martinez, Zhaoen Su, Austin James, Peter Selednik, Stuart Anderson, and Shunsuke Saito. It was posted to arXiv on 22 August 2024 and published in the ECCV 2024 proceedings. Code and model weights were released openly on GitHub and Hugging Face. [1][4][5]
Human-centric perception underpins applications such as augmented and virtual reality, telepresence avatars, motion capture, and photography. Historically each of the four tasks above had its own specialized models, datasets, and architectures, which made systems hard to maintain and limited generalization to unconstrained imagery. Sapiens takes the opposite stance: a single backbone, pretrained once on a very large collection of human images, is finetuned with lightweight task-specific heads. The central bet is that scaling both the model and a human-only pretraining corpus, while keeping inference resolution high, produces representations that transfer across all four tasks and generalize to in-the-wild photos. [1][6]
The design follows the broader foundation model recipe: self-supervised pretraining on a large unlabeled corpus, followed by supervised finetuning for downstream tasks. Sapiens specializes that recipe to humans by curating its pretraining data exclusively from human images and by operating at a resolution high enough to capture fine structure such as fingers and facial detail. [1]
| Task | Output | Evaluation benchmark |
|---|---|---|
| 2D pose estimation | Keypoint coordinates (up to 308 whole-body keypoints in training) | Humans-5K (114 common keypoints) |
| Body-part segmentation | Per-pixel class label (28-class body-part scheme) | Humans-2K |
| Depth estimation | Per-pixel metric/relative depth | THuman2.0, Hi4D |
| Surface-normal prediction | Per-pixel 3D normal vector | THuman2.0, Hi4D |
For pose estimation, the released models support several keypoint vocabularies: a 17-point COCO body skeleton, the 133-point COCO-WholeBody format (body, face, hands, and feet), and a dense 308-point set with detailed facial and hand landmarks. The dense vocabulary is what makes Sapiens useful for high-fidelity face and hand tracking, not just coarse body pose. [7]
Sapiens is pretrained with masked autoencoding (MAE), a self-supervised objective in which a large fraction of image patches are masked and the model learns to reconstruct the missing content. The standard configuration masks 75 percent of patches, and the paper reports that performance held up even at mask ratios as high as 95 percent, reflecting the redundancy of human imagery. Images are split into 16x16 pixel patches at a 1024-pixel input resolution. [1]
The pretraining corpus, called Humans-300M, was built by starting from roughly one billion in-the-wild images and filtering aggressively for humans. The authors discarded images with watermarks, text, or unnatural elements, then ran a person detector and kept images with a detection score above 0.9 and a bounding box larger than 300 pixels in its dimensions. The result is about 300 million diverse human images, of which over 248 million contain multiple subjects. Pretraining used no human annotations, only the raw images. Each model was trained on 1.2 trillion tokens; the largest variant was trained on 1024 NVIDIA A100 GPUs for roughly 18 days. [1]
For downstream tasks, the pretrained encoder is paired with a lightweight, task-specific decoder head (built from deconvolution and convolution layers) that is initialized randomly. The encoder and decoder are then finetuned end-to-end on labeled data for each task. [1]
Sapiens is released in four sizes spanning roughly 0.3 billion to 2 billion parameters. All variants are plain Vision Transformers with a 16x16 patch size operating at 1024-pixel resolution; they differ in depth (number of transformer layers), width (hidden dimension), and head count.
| Model | Parameters | Layers | Hidden size | Heads | FLOPs |
|---|---|---|---|---|---|
| Sapiens-0.3B | 0.336 B | 24 | 1024 | 16 | 1.242 T |
| Sapiens-0.6B | 0.664 B | 32 | 1280 | 16 | 2.583 T |
| Sapiens-1B | 1.169 B | 40 | 1536 | 24 | 4.647 T |
| Sapiens-2B | 2.163 B | 48 | 1920 | 32 | 8.709 T |
A consistent finding across the paper is that accuracy improves smoothly as the model is scaled up across all four tasks, which the authors present as evidence that human-centric perception benefits from the same scaling behavior seen in other foundation models. [1][5]
The paper evaluates against task-specific state-of-the-art baselines and reports sizeable gains, driven by the largest Sapiens-2B model. Improvements quoted below are relative to the prior best methods. [1]
| Task | Benchmark | Metric | Sapiens-2B result | Improvement over prior SOTA |
|---|---|---|---|---|
| Pose estimation | Humans-5K | AP | 61.1 AP | +7.6 mAP |
| Body-part segmentation | Humans-2K | mIoU | 81.2 mIoU (89.4 mAcc) | +17.1 mIoU |
| Depth estimation | Hi4D | RMSE | 0.114 RMSE | 22.4% relative |
| Surface normals | THuman2.0 | Mean angular error | ~11.8 degrees | 53.5% relative |
On surface normals, Sapiens-2B also reports about 12.14 degrees mean angular error on the multi-person Hi4D benchmark. For depth on the single-person THuman2.0 splits, reported RMSE values are roughly 0.008 (face), 0.010 (upper body), and 0.016 (full body). [1]
Beyond the headline numbers, the paper emphasizes generalization: because pretraining covers a very large and varied set of real-world human images, the models perform well on unconstrained photos with occlusion, crowds, and unusual poses, conditions where models trained only on smaller labeled datasets tend to degrade. [1][6]
Meta released the Sapiens code on GitHub (facebookresearch/sapiens) and published pretrained encoders plus finetuned task checkpoints for all four sizes on Hugging Face. The release includes a person detector (RTMPose-based) used in the pose pipeline, and a "Sapiens-lite" inference path for deployment. [3][5][7]
The model weights are distributed under the Creative Commons Attribution-NonCommercial 4.0 license (CC BY-NC 4.0), which permits use, sharing, and modification with attribution but prohibits commercial use; the accompanying code is provided under permissive open-source terms. [2][8]
Sapiens drew attention as one of the first attempts to build a unified, openly released foundation model dedicated specifically to human perception rather than general scene understanding. Coverage in the computer-vision community highlighted the combination of high native resolution, the human-only pretraining corpus, and consistent improvements across four distinct tasks from a single backbone. [6][9]
In April 2026, Meta released a successor, Sapiens2, extending the approach with additional outputs including surface normals, pointmaps, and albedo estimation alongside the original pose and segmentation tasks, again positioned as a high-resolution human-centric vision model. [10]