Perception Encoder
Last reviewed
Sources
9 citations
Review status
Source-backed
Revision
v2 · 1,694 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
9 citations
Review status
Source-backed
Revision
v2 · 1,694 words
Add missing citations, update stale details, or suggest a clearer explanation.
Perception Encoder (PE) is a family of vision and vision-language encoders from Meta AI's Fundamental AI Research (FAIR) group, released in April 2025, whose central finding is that the strongest general-purpose visual embeddings sit in a network's intermediate layers rather than at its final output.[1] A single vision transformer trained only with a contrastive image- and video-text objective reaches state-of-the-art results on classification, retrieval, multimodal language modeling, and dense spatial tasks, once two lightweight alignment methods (language alignment and spatial alignment) are used to surface those hidden internal features.[1] PE is described in the paper "Perception Encoder: The best visual embeddings are not at the output of the network" by Daniel Bolya, Po-Yao Huang, Peize Sun, Piotr Dollar, Christoph Feichtenhofer, and colleagues, and was released openly together with Meta's Perception Language Model (PLM).[1][2]
Perception Encoder is Meta FAIR's open vision and vision-language encoder family, first published on arXiv on April 17, 2025 (arXiv:2504.13181).[1] It is built around a single contrastively trained vision transformer, scaled to roughly 2 billion parameters at the largest size, that produces general-purpose visual features for both images and video. Rather than training separate, task-specialized encoders, PE shows that one backbone plus task-appropriate alignment can serve classification, retrieval, multimodal large language models, and dense prediction at once.[1][3] PE was released under an Apache 2.0 license with weights on Hugging Face and code in the facebookresearch/perception_models repository.[5][6]
Vision encoders such as OpenAI's CLIP are typically trained by aligning images with text captions in a shared embedding space, which makes them good at zero-shot classification and image-text retrieval. A separate line of work, exemplified by Meta's DINOv2, uses self-supervised objectives to learn features that transfer well to dense prediction tasks like segmentation and depth estimation. A common assumption was that no single pretraining recipe could be best for both kinds of task, so practitioners often combined or specialized encoders. Perception Encoder challenges this assumption by showing that one contrastively trained vision transformer can match specialized models across classification, retrieval, multimodal language modeling, and dense spatial tasks, provided its internal features are surfaced correctly.[1]
The paper's title states its main claim: the best visual embeddings are not at the output of the network. As the authors put it, "Contrastive vision-language training alone can produce strong, general embeddings for all of these downstream tasks. There is only one caveat: these embeddings are hidden within the intermediate layers of the network."[1] Through layer-by-layer probing of a frozen PE network, they found that different network depths specialize in different task families, and that the strongest general features for downstream use are concentrated in intermediate layers rather than the final contrastive output layer.[1][3] For the largest model, the most useful features for language tasks were drawn from around layer 47 of the 50-layer vision transformer rather than from the last layer.[4]
The authors attribute this to the robust pretraining techniques used (progressive resolution scheduling, heavy augmentation, and regularization), which push general-purpose features deeper into the network while the final layer becomes more specialized for the contrastive caption-matching objective. To make these hidden features usable, PE introduces two lightweight alignment procedures applied on top of the same contrastively pretrained backbone:[1][3]
The broader argument is that a single contrastive pretraining pipeline, plus task-appropriate alignment, can replace several separately trained, task-specific encoders.[3]
The base encoder, called PE-Core, is trained with an enhanced contrastive image-text objective rather than any task-specific loss. The paper describes a "robust" image pretraining recipe scaled to roughly 1 ZFLOP of compute over about 2.3 billion image-text pairs, with several refinements over a vanilla CLIP recipe:[1][4]
To extend the encoder to video, the authors built a video data engine that produced roughly 22 million synthetic video captions, and they trained on video by simply averaging features from a handful of uniformly sampled frames (eight in the reported setup). They report that adding video data improved both video and image performance.[1] As part of the release Meta also published the PE Video Dataset (PVD), a collection of about one million videos with roughly 120,000 human-refined captions, totaling several thousand hours of footage.[2][4]
PE-Lang is produced from PE-Core by language alignment: intermediate features are warmed up and then fine-tuned on a large mix of image and video samples so they can drive a language decoder. PE-Spatial is produced by spatial alignment, which takes the strong spatial features already present in PE-Core's intermediate layers and aligns them to the output using a frozen-teacher self-distillation loss, further refined with a mask-based strategy derived from SAM 2.[3][5]
PE is released in three scales (B, L, and G) and in three functional variants (Core, Lang, and Spatial), under an Apache 2.0 license, with weights on Hugging Face and code in the facebookresearch/perception_models repository.[5][6] The three scales of the contrastive backbone are summarized below.
| Scale | Vision params | ViT width | Depth | Typical config |
|---|---|---|---|---|
| PE-Core B | 0.09B | 768 | 12 | B/16, 224px |
| PE-Core L | 0.32B | 1024 | 24 | L/14, 336px |
| PE-Core G | 1.88B | 1536 | 50 | G/14, 448px |
The functional variants differ by how the backbone is used:
| Variant | Purpose | Notable configurations |
|---|---|---|
| PE-Core | Zero-shot classification and image/video-text retrieval | T/16, S/16, B/16, L/14, G/14 |
| PE-Lang | Vision encoder for multimodal language models | L/14 448px, G/14 448px |
| PE-Spatial | Dense prediction (detection, segmentation, depth, tracking) | G/14 448px, plus distilled B/L variants |
In December 2025 Meta added an audio-visual variant, PE-AV, which extends the family to audio inputs and serves as an encoder for related projects; it is a later addition and not part of the original April 2025 paper.[7]
Meta reported state-of-the-art or best-in-class results across a broad suite of benchmarks, drawn from PE-Core, PE-Lang (via PLM), and PE-Spatial. Representative numbers for the G-scale models are below.[1][4][6]
| Task | Benchmark | Result |
|---|---|---|
| Zero-shot image classification | ImageNet-1k val | 85.4% top-1 |
| Zero-shot robustness (PE-Core G) | ImageNet average robustness | 86.6% |
| Zero-shot image classification | ImageNet-Adversarial | 92.6% |
| Zero-shot image classification | ObjectNet | 88.2% |
| Zero-shot video classification | Kinetics-400 | 76.9% |
| Object detection (PE-Spatial G) | COCO box mAP | 66.0 |
| Document QA (via PLM-8B) | DocVQA | 94.6% |
| Chart QA (via PLM-8B) | ChartQA | 85.5% |
The 86.6% average across ImageNet robustness benchmarks is highlighted as the first time an open contrastive model of this kind outperformed encoders trained on large proprietary datasets such as JFT-3B and WebLI on that aggregate measure.[1] The 66.0 box mAP on COCO detection is reported as a new absolute state of the art at the time, reached with a comparatively simple detection decoder on top of PE-Spatial rather than a heavily engineered, detection-specific backbone.[3][5] The document- and chart-understanding numbers come from PLM-8B, which uses a PE encoder, and serve as evidence that the aligned features transfer to multimodal reasoning.[4]
Perception Encoder was released alongside the Perception Language Model, an open and reproducible vision-language model that uses a PE encoder as its visual front end. PLM pairs a language-aligned PE encoder with Llama 3 decoders at 1B, 3B, and 8B parameters, connected by a two-layer MLP projector, and supports high-resolution image tiling and multi-frame video input. Notably, PLM was trained on synthetic and human-labeled data without distillation from external proprietary models.[2][4] Meta also released a large set of human-labeled video question-answering and spatio-temporal caption samples and a video benchmark, PLM-VideoBench, to accompany it.[2]
PE and PLM were two of five releases announced by Meta FAIR on April 17, 2025; the others were Meta Locate 3D (3D object localization from natural-language queries), the Dynamic Byte Latent Transformer, and the Collaborative Reasoner framework.[2]
The paper was accepted to NeurIPS 2025.[8] Coverage framed PE as evidence that a single, simply trained encoder can serve as a general visual backbone, and that careful data, training, and alignment choices matter more than task-specific pretraining objectives for producing transferable features.[9] Because the models and the PE Video Dataset were released openly, PE has been used as a drop-in vision encoder in subsequent open multimodal systems.