Perception Encoder
Last reviewed
Jun 3, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 1,519 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 1,519 words
Add missing citations, update stale details, or suggest a clearer explanation.
Perception Encoder (PE) is a family of vision and vision-language encoders developed by Meta AI's Fundamental AI Research (FAIR) group and released in April 2025. It is described in the paper "Perception Encoder: The best visual embeddings are not at the output of the network" by Daniel Bolya, Po-Yao Huang, Christoph Feichtenhofer, Piotr Dollár, and colleagues.[1] The work argues that strong, general-purpose visual representations can be produced by a single network trained only with a contrastive image- and video-text objective, and that the most useful features for downstream tasks often sit in the network's intermediate layers rather than at its final output. PE was released together with Meta's Perception Language Model (PLM) and several other artifacts from the "Perception" line of research.[2]
Vision encoders such as OpenAI's CLIP are typically trained by aligning images with text captions in a shared embedding space, which makes them good at zero-shot classification and image-text retrieval. A separate line of work, exemplified by Meta's DINOv2, uses self-supervised objectives to learn features that transfer well to dense prediction tasks like segmentation and depth estimation. A common assumption was that no single pretraining recipe could be best for both kinds of task, so practitioners often combined or specialized encoders. Perception Encoder challenges this assumption by showing that one contrastively trained vision transformer can match specialized models across classification, retrieval, multimodal language modeling, and dense spatial tasks, provided its internal features are surfaced correctly.[1]
The paper's title states its main claim: the best visual embeddings are not at the output of the network. Through layer-by-layer probing of a frozen PE network, the authors found that different network depths specialize in different task families, and that the strongest general features for downstream use are concentrated in intermediate layers rather than the final contrastive output layer.[1][3] For the largest model, the most useful features for language tasks were drawn from around layer 47 of the 50-layer vision transformer rather than from the last layer.[4]
The authors attribute this to the robust pretraining techniques used (progressive resolution scheduling, heavy augmentation, and regularization), which push general-purpose features deeper into the network while the final layer becomes more specialized for the contrastive caption-matching objective. To make these hidden features usable, PE introduces two lightweight alignment procedures applied on top of the same contrastively pretrained backbone:[1][3]
The broader argument is that a single contrastive pretraining pipeline, plus task-appropriate alignment, can replace several separately trained, task-specific encoders.[3]
The base encoder, called PE-Core, is trained with an enhanced contrastive image-text objective rather than any task-specific loss. The paper describes a "robust" image pretraining recipe scaled to roughly 1 ZFLOP of compute over about 2.3 billion image-text pairs, with several refinements over a vanilla CLIP recipe:[1][4]
To extend the encoder to video, the authors built a video data engine that produced roughly 22 million synthetic video captions, and they trained on video by simply averaging features from a handful of uniformly sampled frames (eight in the reported setup). They report that adding video data improved both video and image performance.[1] As part of the release Meta also published the PE Video Dataset (PVD), a collection of about one million videos with roughly 120,000 human-refined captions, totaling several thousand hours of footage.[2][4]
PE-Lang is produced from PE-Core by language alignment: intermediate features are warmed up and then fine-tuned on a large mix of image and video samples so they can drive a language decoder. PE-Spatial is produced by spatial alignment, which takes the strong spatial features already present in PE-Core's intermediate layers and aligns them to the output using a frozen-teacher self-distillation loss, further refined with a mask-based strategy derived from SAM 2.[3][5]
PE is released in three scales (B, L, and G) and in three functional variants (Core, Lang, and Spatial), under an Apache 2.0 license, with weights on Hugging Face and code in the facebookresearch/perception_models repository.[5][6] The three scales of the contrastive backbone are summarized below.
| Scale | Vision params | ViT width | Depth | Typical config |
|---|---|---|---|---|
| PE-Core B | 0.09B | 768 | 12 | B/16, 224px |
| PE-Core L | 0.32B | 1024 | 24 | L/14, 336px |
| PE-Core G | 1.88B | 1536 | 50 | G/14, 448px |
The functional variants differ by how the backbone is used:
| Variant | Purpose | Notable configurations |
|---|---|---|
| PE-Core | Zero-shot classification and image/video-text retrieval | T/16, S/16, B/16, L/14, G/14 |
| PE-Lang | Vision encoder for multimodal language models | L/14 448px, G/14 448px |
| PE-Spatial | Dense prediction (detection, segmentation, depth, tracking) | G/14 448px, plus distilled B/L variants |
In December 2025 Meta added an audio-visual variant, PE-AV, which extends the family to audio inputs and serves as an encoder for related projects; it is a later addition and not part of the original April 2025 paper.[7]
Meta reported state-of-the-art or best-in-class results across a broad suite of benchmarks, drawn from PE-Core, PE-Lang (via PLM), and PE-Spatial. Representative numbers for the G-scale models are below.[1][4][6]
| Task | Benchmark | Result |
|---|---|---|
| Zero-shot image classification | ImageNet-1k val | 85.4% top-1 |
| Zero-shot robustness (PE-Core G) | ImageNet average robustness | 86.6% |
| Zero-shot image classification | ImageNet-Adversarial | 92.6% |
| Zero-shot image classification | ObjectNet | 88.2% |
| Zero-shot video classification | Kinetics-400 | 76.9% |
| Object detection (PE-Spatial G) | COCO box mAP | 66.0 |
| Document QA (via PLM-8B) | DocVQA | 94.6% |
| Chart QA (via PLM-8B) | ChartQA | 85.5% |
The 86.6% average across ImageNet robustness benchmarks is highlighted as the first time an open contrastive model of this kind outperformed encoders trained on large proprietary datasets such as JFT-3B and WebLI on that aggregate measure.[1] The 66.0 box mAP on COCO detection is reported as a new absolute state of the art at the time, reached with a comparatively simple detection decoder on top of PE-Spatial rather than a heavily engineered, detection-specific backbone.[3][5] The document- and chart-understanding numbers come from PLM-8B, which uses a PE encoder, and serve as evidence that the aligned features transfer to multimodal reasoning.[4]
Perception Encoder was released alongside the Perception Language Model, an open and reproducible vision-language model that uses a PE encoder as its visual front end. PLM pairs a language-aligned PE encoder with Llama 3 decoders at 1B, 3B, and 8B parameters, connected by a two-layer MLP projector, and supports high-resolution image tiling and multi-frame video input. Notably, PLM was trained on synthetic and human-labeled data without distillation from external proprietary models.[2][4] Meta also released a large set of human-labeled video question-answering and spatio-temporal caption samples and a video benchmark, PLM-VideoBench, to accompany it.[2]
PE and PLM were two of five releases announced by Meta FAIR on April 17, 2025; the others were Meta Locate 3D (3D object localization from natural-language queries), the Dynamic Byte Latent Transformer, and the Collaborative Reasoner framework.[2]
The paper was accepted to NeurIPS 2025.[8] Coverage framed PE as evidence that a single, simply trained encoder can serve as a general visual backbone, and that careful data, training, and alignment choices matter more than task-specific pretraining objectives for producing transferable features.[9] Because the models and the PE Video Dataset were released openly, PE has been used as a drop-in vision encoder in subsequent open multimodal systems.