Perception Encoder

Computer Vision Meta AI Multimodal AI

8 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

9 citations

Revision

v2 · 1,694 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Perception Encoder (PE) is a family of vision and vision-language encoders from Meta AI's Fundamental AI Research (FAIR) group, released in April 2025, whose central finding is that the strongest general-purpose visual embeddings sit in a network's intermediate layers rather than at its final output.^[1] A single vision transformer trained only with a contrastive image- and video-text objective reaches state-of-the-art results on classification, retrieval, multimodal language modeling, and dense spatial tasks, once two lightweight alignment methods (language alignment and spatial alignment) are used to surface those hidden internal features.^[1] PE is described in the paper "Perception Encoder: The best visual embeddings are not at the output of the network" by Daniel Bolya, Po-Yao Huang, Peize Sun, Piotr Dollar, Christoph Feichtenhofer, and colleagues, and was released openly together with Meta's Perception Language Model (PLM).^[1]^[2]

What is the Perception Encoder?

Perception Encoder is Meta FAIR's open vision and vision-language encoder family, first published on arXiv on April 17, 2025 (arXiv:2504.13181).^[1] It is built around a single contrastively trained vision transformer, scaled to roughly 2 billion parameters at the largest size, that produces general-purpose visual features for both images and video. Rather than training separate, task-specialized encoders, PE shows that one backbone plus task-appropriate alignment can serve classification, retrieval, multimodal large language models, and dense prediction at once.^[1]^[3] PE was released under an Apache 2.0 license with weights on Hugging Face and code in the facebookresearch/perception_models repository.^[5]^[6]

How does Perception Encoder differ from CLIP and DINOv2?

Vision encoders such as OpenAI's CLIP are typically trained by aligning images with text captions in a shared embedding space, which makes them good at zero-shot classification and image-text retrieval. A separate line of work, exemplified by Meta's DINOv2, uses self-supervised objectives to learn features that transfer well to dense prediction tasks like segmentation and depth estimation. A common assumption was that no single pretraining recipe could be best for both kinds of task, so practitioners often combined or specialized encoders. Perception Encoder challenges this assumption by showing that one contrastively trained vision transformer can match specialized models across classification, retrieval, multimodal language modeling, and dense spatial tasks, provided its internal features are surfaced correctly.^[1]

What is the key insight behind Perception Encoder?

The paper's title states its main claim: the best visual embeddings are not at the output of the network. As the authors put it, "Contrastive vision-language training alone can produce strong, general embeddings for all of these downstream tasks. There is only one caveat: these embeddings are hidden within the intermediate layers of the network."^[1] Through layer-by-layer probing of a frozen PE network, they found that different network depths specialize in different task families, and that the strongest general features for downstream use are concentrated in intermediate layers rather than the final contrastive output layer.^[1]^[3] For the largest model, the most useful features for language tasks were drawn from around layer 47 of the 50-layer vision transformer rather than from the last layer.^[4]

The authors attribute this to the robust pretraining techniques used (progressive resolution scheduling, heavy augmentation, and regularization), which push general-purpose features deeper into the network while the final layer becomes more specialized for the contrastive caption-matching objective. To make these hidden features usable, PE introduces two lightweight alignment procedures applied on top of the same contrastively pretrained backbone:^[1]^[3]

Language alignment, which adapts intermediate features for use as the vision encoder in a multimodal large language model.
Spatial alignment, which adapts intermediate features for dense prediction tasks such as detection, segmentation, depth estimation, and tracking.

The broader argument is that a single contrastive pretraining pipeline, plus task-appropriate alignment, can replace several separately trained, task-specific encoders.^[3]

How is Perception Encoder trained?

The base encoder, called PE-Core, is trained with an enhanced contrastive image-text objective rather than any task-specific loss. The paper describes a "robust" image pretraining recipe scaled to roughly 1 ZFLOP of compute over about 2.3 billion image-text pairs, with several refinements over a vanilla CLIP recipe:^[1]^[4]

progressive increase of input resolution during training (for example from 98 up to 336 pixels);
larger training batch sizes (on the order of tens of thousands of pairs);
the LAMB optimizer with a higher learning rate;
2D rotary positional embeddings (2D RoPE);
attention pooling on top of the vision tower;
aggressive data augmentation and mask-based regularization.

To extend the encoder to video, the authors built a video data engine that produced roughly 22 million synthetic video captions, and they trained on video by simply averaging features from a handful of uniformly sampled frames (eight in the reported setup). They report that adding video data improved both video and image performance.^[1] As part of the release Meta also published the PE Video Dataset (PVD), a collection of about one million videos with roughly 120,000 human-refined captions, totaling several thousand hours of footage.^[2]^[4]

PE-Lang is produced from PE-Core by language alignment: intermediate features are warmed up and then fine-tuned on a large mix of image and video samples so they can drive a language decoder. PE-Spatial is produced by spatial alignment, which takes the strong spatial features already present in PE-Core's intermediate layers and aligns them to the output using a frozen-teacher self-distillation loss, further refined with a mask-based strategy derived from SAM 2.^[3]^[5]

What are the PE variants and sizes?

PE is released in three scales (B, L, and G) and in three functional variants (Core, Lang, and Spatial), under an Apache 2.0 license, with weights on Hugging Face and code in the facebookresearch/perception_models repository.^[5]^[6] The three scales of the contrastive backbone are summarized below.

Scale	Vision params	ViT width	Depth	Typical config
PE-Core B	0.09B	768	12	B/16, 224px
PE-Core L	0.32B	1024	24	L/14, 336px
PE-Core G	1.88B	1536	50	G/14, 448px

The functional variants differ by how the backbone is used:

Variant	Purpose	Notable configurations
PE-Core	Zero-shot classification and image/video-text retrieval	T/16, S/16, B/16, L/14, G/14
PE-Lang	Vision encoder for multimodal language models	L/14 448px, G/14 448px
PE-Spatial	Dense prediction (detection, segmentation, depth, tracking)	G/14 448px, plus distilled B/L variants

In December 2025 Meta added an audio-visual variant, PE-AV, which extends the family to audio inputs and serves as an encoder for related projects; it is a later addition and not part of the original April 2025 paper.^[7]

How well does Perception Encoder perform on benchmarks?

Meta reported state-of-the-art or best-in-class results across a broad suite of benchmarks, drawn from PE-Core, PE-Lang (via PLM), and PE-Spatial. Representative numbers for the G-scale models are below.^[1]^[4]^[6]

Task	Benchmark	Result
Zero-shot image classification	ImageNet-1k val	85.4% top-1
Zero-shot robustness (PE-Core G)	ImageNet average robustness	86.6%
Zero-shot image classification	ImageNet-Adversarial	92.6%
Zero-shot image classification	ObjectNet	88.2%
Zero-shot video classification	Kinetics-400	76.9%
Object detection (PE-Spatial G)	COCO box mAP	66.0
Document QA (via PLM-8B)	DocVQA	94.6%
Chart QA (via PLM-8B)	ChartQA	85.5%

The 86.6% average across ImageNet robustness benchmarks is highlighted as the first time an open contrastive model of this kind outperformed encoders trained on large proprietary datasets such as JFT-3B and WebLI on that aggregate measure.^[1] The 66.0 box mAP on COCO detection is reported as a new absolute state of the art at the time, reached with a comparatively simple detection decoder on top of PE-Spatial rather than a heavily engineered, detection-specific backbone.^[3]^[5] The document- and chart-understanding numbers come from PLM-8B, which uses a PE encoder, and serve as evidence that the aligned features transfer to multimodal reasoning.^[4]

How does Perception Encoder relate to the Perception Language Model?

Perception Encoder was released alongside the Perception Language Model, an open and reproducible vision-language model that uses a PE encoder as its visual front end. PLM pairs a language-aligned PE encoder with Llama 3 decoders at 1B, 3B, and 8B parameters, connected by a two-layer MLP projector, and supports high-resolution image tiling and multi-frame video input. Notably, PLM was trained on synthetic and human-labeled data without distillation from external proprietary models.^[2]^[4] Meta also released a large set of human-labeled video question-answering and spatio-temporal caption samples and a video benchmark, PLM-VideoBench, to accompany it.^[2]

PE and PLM were two of five releases announced by Meta FAIR on April 17, 2025; the others were Meta Locate 3D (3D object localization from natural-language queries), the Dynamic Byte Latent Transformer, and the Collaborative Reasoner framework.^[2]

Reception

The paper was accepted to NeurIPS 2025.^[8] Coverage framed PE as evidence that a single, simply trained encoder can serve as a general visual backbone, and that careful data, training, and alignment choices matter more than task-specific pretraining objectives for producing transferable features.^[9] Because the models and the PE Video Dataset were released openly, PE has been used as a drop-in vision encoder in subsequent open multimodal systems.

References

Bolya, Daniel; Huang, Po-Yao; Sun, Peize; et al. "Perception Encoder: The best visual embeddings are not at the output of the network." arXiv:2504.13181, April 17, 2025. https://arxiv.org/abs/2504.13181 ↩
Meta AI. "Advancing AI systems through progress in perception, localization, and reasoning." April 17, 2025. https://ai.meta.com/blog/meta-fair-updates-perception-localization-reasoning/ ↩
alphaXiv. "Perception Encoder: The best visual embeddings are not at the output of the network (overview)." https://www.alphaxiv.org/overview/2504.13181v2 ↩
Hugging Face. "Paper page: Perception Encoder: The best visual embeddings are not at the output of the network." https://huggingface.co/papers/2504.13181 ↩
Hugging Face. "facebook/PE-Spatial-G14-448 model card." https://huggingface.co/facebook/PE-Spatial-G14-448 ↩
facebookresearch. "perception_models: State-of-the-art Image & Video CLIP, Multimodal Large Language Models, and More." GitHub. https://github.com/facebookresearch/perception_models ↩
MarkTechPost. "Meta AI Open-Sourced Perception Encoder Audiovisual (PE-AV)." December 22, 2025. https://www.marktechpost.com/2025/12/22/meta-ai-open-sourced-perception-encoder-audiovisual-pe-av-the-audiovisual-encoder-powering-sam-audio-and-large-scale-multimodal-retrieval/ ↩
NeurIPS. "Perception Encoder: The best visual embeddings are not at the output of the network (poster)." NeurIPS 2025. https://neurips.cc/virtual/2025/poster/118805 ↩
MarkTechPost. "Meta AI Introduces Perception Encoder: A Large-Scale Vision Encoder that Excels Across Several Vision Tasks for Images and Video." April 18, 2025. https://www.marktechpost.com/2025/04/18/meta-ai-introduces-perception-encoder-a-large-scale-vision-encoder-that-excels-across-several-vision-tasks-for-images-and-video/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

DINOv2 DINOv3 Oriol Vinyals

What is the Perception Encoder?

How does Perception Encoder differ from CLIP and DINOv2?

What is the key insight behind Perception Encoder?

How is Perception Encoder trained?

What are the PE variants and sizes?

How well does Perception Encoder perform on benchmarks?

How does Perception Encoder relate to the Perception Language Model?

Reception

References

Improve this article

Related Articles

Llama 3.2

Llama 4 Scout and Maverick

Llama 3.2 Vision

Muse Spark

Chameleon (Meta AI)

CM3leon

What links here

Related Articles

Llama 3.2

Llama 4 Scout and Maverick

Llama 3.2 Vision

Muse Spark

Chameleon (Meta AI)

CM3leon

What links here