Sapiens (computer vision)
Last reviewed
Sources
10 citations
Review status
Source-backed
Revision
v2 · 1,604 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
10 citations
Review status
Source-backed
Revision
v2 · 1,604 words
Add missing citations, update stale details, or suggest a clearer explanation.
Sapiens is a family of human-centric computer vision foundation models developed by Meta (Reality Labs), introduced in 2024 and presented as an oral paper at the European Conference on Computer Vision (ECCV) 2024. Sapiens models are Vision Transformers that target four fundamental human-centric vision tasks defined over images of people: 2D human pose estimation, body-part segmentation, depth estimation, and surface-normal prediction. The models are pretrained by masked autoencoding on a curated corpus of over 300 million in-the-wild human images, scale from roughly 0.3 billion up to 2 billion parameters, and run natively at 1K (1024-pixel) high resolution. [1][2][3]
The paper summarizes the approach in one sentence: "We present Sapiens, a family of models for four fundamental human-centric vision tasks -- 2D pose estimation, body-part segmentation, depth estimation, and surface normal prediction." [1] The project page describes the result more plainly as "high-resolution vision transformers pretrained on 300 million human images." [3]
Sapiens is a single, openly released Vision Transformer backbone, pretrained once on a very large human-only image corpus, that can be finetuned with lightweight task heads to perform four distinct human-centric perception tasks at high resolution. Rather than maintaining a separate specialized model for each task, Sapiens shows that one scaled, human-pretrained encoder transfers across all four and generalizes to unconstrained real-world photos. The accompanying paper, "Sapiens: Foundation for Human Vision Models," was authored by Rawal Khirodkar, Timur Bagautdinov, Julieta Martinez, Zhaoen Su, Austin James, Peter Selednik, Stuart Anderson, and Shunsuke Saito. It was posted to arXiv on 22 August 2024 and published in the ECCV 2024 proceedings, with code and model weights released openly on GitHub and Hugging Face. [1][4][5]
Human-centric perception underpins applications such as augmented and virtual reality, telepresence avatars, motion capture, and photography. Historically each of the four tasks above had its own specialized models, datasets, and architectures, which made systems hard to maintain and limited generalization to unconstrained imagery. Sapiens takes the opposite stance: a single backbone, pretrained once on a very large collection of human images, is finetuned with lightweight task-specific heads. The central bet is that scaling both the model and a human-only pretraining corpus, while keeping inference resolution high, produces representations that transfer across all four tasks and generalize to in-the-wild photos. [1][6]
The design follows the broader foundation model recipe: self-supervised pretraining on a large unlabeled corpus, followed by supervised finetuning for downstream tasks. Sapiens specializes that recipe to humans by curating its pretraining data exclusively from human images and by operating at a resolution high enough to capture fine structure such as fingers and facial detail. As the abstract puts it, "given the same computational budget, self-supervised pretraining on a curated dataset of human images significantly boosts the performance for a diverse set of human-centric tasks." [1]
Sapiens performs the four fundamental human-centric vision tasks listed below. Each is handled by the shared pretrained encoder plus a lightweight task-specific head.
| Task | Output | Evaluation benchmark |
|---|---|---|
| 2D pose estimation | Keypoint coordinates (up to 308 whole-body keypoints in training) | Humans-5K (114 common keypoints) |
| Body-part segmentation | Per-pixel class label (28-class body-part scheme) | Humans-2K |
| Depth estimation | Per-pixel metric/relative depth | THuman2.0, Hi4D |
| Surface-normal prediction | Per-pixel 3D normal vector | THuman2.0, Hi4D |
For pose estimation, the released models support several keypoint vocabularies: a 17-point COCO body skeleton, the 133-point COCO-WholeBody format (body, face, hands, and feet), and a dense 308-point set with detailed facial and hand landmarks. The dense vocabulary is what makes Sapiens useful for high-fidelity face and hand tracking, not just coarse body pose. [7]
Sapiens is pretrained with masked autoencoding (MAE), a self-supervised objective in which a large fraction of image patches are masked and the model learns to reconstruct the missing content. The standard configuration masks 75 percent of patches, and the paper reports that performance held up even at mask ratios as high as 95 percent, reflecting the redundancy of human imagery. Images are split into 16x16 pixel patches at a 1024-pixel input resolution. [1]
The pretraining corpus, called Humans-300M, was built by starting from roughly one billion in-the-wild images and filtering aggressively for humans. The authors discarded images with watermarks, text, or unnatural elements, then ran a person detector and kept images with a detection score above 0.9 and a bounding box larger than 300 pixels in its dimensions. The result is about 300 million diverse human images, of which over 248 million contain multiple subjects. Pretraining used no human annotations, only the raw images. Each model was trained on 1.2 trillion tokens; the largest variant was trained on 1024 NVIDIA A100 GPUs for roughly 18 days. [1]
For downstream tasks, the pretrained encoder is paired with a lightweight, task-specific decoder head (built from deconvolution and convolution layers) that is initialized randomly. The encoder and decoder are then finetuned end-to-end on labeled data for each task. [1]
Sapiens is released in four sizes spanning roughly 0.3 billion to 2 billion parameters. All variants are plain Vision Transformers with a 16x16 patch size operating at 1024-pixel resolution; they differ in depth (number of transformer layers), width (hidden dimension), and head count.
| Model | Parameters | Layers | Hidden size | Heads | FLOPs |
|---|---|---|---|---|---|
| Sapiens-0.3B | 0.336 B | 24 | 1024 | 16 | 1.242 T |
| Sapiens-0.6B | 0.664 B | 32 | 1280 | 16 | 2.583 T |
| Sapiens-1B | 1.169 B | 40 | 1536 | 24 | 4.647 T |
| Sapiens-2B | 2.163 B | 48 | 1920 | 32 | 8.709 T |
A consistent finding across the paper is that accuracy improves smoothly as the model is scaled up across all four tasks. In the authors' words, the "simple model design also brings scalability -- model performance across tasks improves as we scale the number of parameters from 0.3 to 2 billion," which they present as evidence that human-centric perception benefits from the same scaling behavior seen in other foundation models. [1][5]
The paper evaluates against task-specific state-of-the-art baselines and reports sizeable gains, driven by the largest Sapiens-2B model. Improvements quoted below are relative to the prior best methods. [1]
| Task | Benchmark | Metric | Sapiens-2B result | Improvement over prior SOTA |
|---|---|---|---|---|
| Pose estimation | Humans-5K | AP | 61.1 AP | +7.6 mAP |
| Body-part segmentation | Humans-2K | mIoU | 81.2 mIoU (89.4 mAcc) | +17.1 mIoU |
| Depth estimation | Hi4D | RMSE | 0.114 RMSE | 22.4% relative |
| Surface normals | THuman2.0 | Mean angular error | ~11.8 degrees | 53.5% relative |
On surface normals, Sapiens-2B also reports about 12.14 degrees mean angular error on the multi-person Hi4D benchmark. For depth on the single-person THuman2.0 splits, reported RMSE values are roughly 0.008 (face), 0.010 (upper body), and 0.016 (full body). [1]
Beyond the headline numbers, the paper emphasizes generalization: because pretraining covers a very large and varied set of real-world human images, the models perform well on unconstrained photos with occlusion, crowds, and unusual poses, conditions where models trained only on smaller labeled datasets tend to degrade. The abstract notes that "the resulting models exhibit remarkable generalization to in-the-wild data, even when labeled data is scarce or entirely synthetic." [1][6]
Meta released the Sapiens code on GitHub (facebookresearch/sapiens) and published pretrained encoders plus finetuned task checkpoints for all four sizes on Hugging Face. The release includes a person detector (RTMPose-based) used in the pose pipeline, and a "Sapiens-lite" inference path for deployment. [3][5][7]
The model weights are distributed under the Creative Commons Attribution-NonCommercial 4.0 license (CC BY-NC 4.0), which permits use, sharing, and modification with attribution but prohibits commercial use; the accompanying code is provided under permissive open-source terms. [2][8]
Sapiens drew attention as one of the first attempts to build a unified, openly released foundation model dedicated specifically to human perception rather than general scene understanding. Coverage in the computer-vision community highlighted the combination of high native resolution, the human-only pretraining corpus, and consistent improvements across four distinct tasks from a single backbone. [6][9]
In April 2026, Meta released a successor, Sapiens2, extending the approach with additional outputs including surface normals, pointmaps, and albedo estimation alongside the original pose and segmentation tasks, again positioned as a high-resolution human-centric vision model. [10]