# Perception Encoder

> Source: https://aiwiki.ai/wiki/perception_encoder
> Updated: 2026-06-28
> Categories: Computer Vision, Meta AI, Multimodal AI
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Perception Encoder** (**PE**) is a family of vision and vision-language encoders from [Meta AI](/wiki/meta_ai)'s Fundamental AI Research (FAIR) group, released in April 2025, whose central finding is that the strongest general-purpose visual embeddings sit in a network's intermediate layers rather than at its final output.[1] A single [vision transformer](/wiki/vision_transformer) trained only with a [contrastive](/wiki/contrastive_learning) image- and video-text objective reaches state-of-the-art results on classification, retrieval, multimodal language modeling, and dense spatial tasks, once two lightweight alignment methods (language alignment and spatial alignment) are used to surface those hidden internal features.[1] PE is described in the paper "Perception Encoder: The best visual embeddings are not at the output of the network" by Daniel Bolya, Po-Yao Huang, Peize Sun, Piotr Dollar, Christoph Feichtenhofer, and colleagues, and was released openly together with Meta's [Perception Language Model](/wiki/perception_language_model) (PLM).[1][2]

## What is the Perception Encoder?

Perception Encoder is Meta FAIR's open vision and vision-language encoder family, first published on arXiv on April 17, 2025 (arXiv:2504.13181).[1] It is built around a single contrastively trained vision transformer, scaled to roughly 2 billion parameters at the largest size, that produces general-purpose visual features for both images and video. Rather than training separate, task-specialized encoders, PE shows that one backbone plus task-appropriate alignment can serve classification, retrieval, multimodal large language models, and dense prediction at once.[1][3] PE was released under an Apache 2.0 license with weights on Hugging Face and code in the `facebookresearch/perception_models` repository.[5][6]

## How does Perception Encoder differ from CLIP and DINOv2?

Vision encoders such as OpenAI's [CLIP](/wiki/clip) are typically trained by aligning images with text captions in a shared embedding space, which makes them good at zero-shot classification and image-text retrieval. A separate line of work, exemplified by Meta's [DINOv2](/wiki/dinov2), uses self-supervised objectives to learn features that transfer well to dense prediction tasks like segmentation and depth estimation. A common assumption was that no single pretraining recipe could be best for both kinds of task, so practitioners often combined or specialized encoders. Perception Encoder challenges this assumption by showing that one contrastively trained vision transformer can match specialized models across classification, retrieval, multimodal language modeling, and dense spatial tasks, provided its internal features are surfaced correctly.[1]

## What is the key insight behind Perception Encoder?

The paper's title states its main claim: the best visual embeddings are not at the output of the network. As the authors put it, "Contrastive vision-language training alone can produce strong, general embeddings for all of these downstream tasks. There is only one caveat: these embeddings are hidden within the intermediate layers of the network."[1] Through layer-by-layer probing of a frozen PE network, they found that different network depths specialize in different task families, and that the strongest general features for downstream use are concentrated in intermediate layers rather than the final contrastive output layer.[1][3] For the largest model, the most useful features for language tasks were drawn from around layer 47 of the 50-layer vision transformer rather than from the last layer.[4]

The authors attribute this to the robust pretraining techniques used (progressive resolution scheduling, heavy augmentation, and regularization), which push general-purpose features deeper into the network while the final layer becomes more specialized for the contrastive caption-matching objective. To make these hidden features usable, PE introduces two lightweight alignment procedures applied on top of the same contrastively pretrained backbone:[1][3]

- **Language alignment**, which adapts intermediate features for use as the vision encoder in a multimodal large language model.
- **Spatial alignment**, which adapts intermediate features for dense prediction tasks such as detection, segmentation, depth estimation, and tracking.

The broader argument is that a single contrastive pretraining pipeline, plus task-appropriate alignment, can replace several separately trained, task-specific encoders.[3]

## How is Perception Encoder trained?

The base encoder, called PE-Core, is trained with an enhanced contrastive image-text objective rather than any task-specific loss. The paper describes a "robust" image pretraining recipe scaled to roughly 1 ZFLOP of compute over about 2.3 billion image-text pairs, with several refinements over a vanilla CLIP recipe:[1][4]

- progressive increase of input resolution during training (for example from 98 up to 336 pixels);
- larger training batch sizes (on the order of tens of thousands of pairs);
- the LAMB optimizer with a higher learning rate;
- 2D rotary positional embeddings (2D RoPE);
- attention pooling on top of the vision tower;
- aggressive data augmentation and mask-based regularization.

To extend the encoder to video, the authors built a video data engine that produced roughly 22 million synthetic video captions, and they trained on video by simply averaging features from a handful of uniformly sampled frames (eight in the reported setup). They report that adding video data improved both video and image performance.[1] As part of the release Meta also published the PE Video Dataset (PVD), a collection of about one million videos with roughly 120,000 human-refined captions, totaling several thousand hours of footage.[2][4]

PE-Lang is produced from PE-Core by language alignment: intermediate features are warmed up and then fine-tuned on a large mix of image and video samples so they can drive a language decoder. PE-Spatial is produced by spatial alignment, which takes the strong spatial features already present in PE-Core's intermediate layers and aligns them to the output using a frozen-teacher self-distillation loss, further refined with a mask-based strategy derived from [SAM 2](/wiki/sam_2).[3][5]

## What are the PE variants and sizes?

PE is released in three scales (B, L, and G) and in three functional variants (Core, Lang, and Spatial), under an Apache 2.0 license, with weights on Hugging Face and code in the `facebookresearch/perception_models` repository.[5][6] The three scales of the contrastive backbone are summarized below.

| Scale | Vision params | ViT width | Depth | Typical config |
|-------|--------------|-----------|-------|----------------|
| PE-Core B | 0.09B | 768 | 12 | B/16, 224px |
| PE-Core L | 0.32B | 1024 | 24 | L/14, 336px |
| PE-Core G | 1.88B | 1536 | 50 | G/14, 448px |

The functional variants differ by how the backbone is used:

| Variant | Purpose | Notable configurations |
|---------|---------|------------------------|
| PE-Core | Zero-shot classification and image/video-text retrieval | T/16, S/16, B/16, L/14, G/14 |
| PE-Lang | Vision encoder for multimodal language models | L/14 448px, G/14 448px |
| PE-Spatial | Dense prediction (detection, segmentation, depth, tracking) | G/14 448px, plus distilled B/L variants |

In December 2025 Meta added an audio-visual variant, PE-AV, which extends the family to audio inputs and serves as an encoder for related projects; it is a later addition and not part of the original April 2025 paper.[7]

## How well does Perception Encoder perform on benchmarks?

Meta reported state-of-the-art or best-in-class results across a broad suite of benchmarks, drawn from PE-Core, PE-Lang (via PLM), and PE-Spatial. Representative numbers for the G-scale models are below.[1][4][6]

| Task | Benchmark | Result |
|------|-----------|--------|
| Zero-shot image classification | ImageNet-1k val | 85.4% top-1 |
| Zero-shot robustness (PE-Core G) | ImageNet average robustness | 86.6% |
| Zero-shot image classification | ImageNet-Adversarial | 92.6% |
| Zero-shot image classification | ObjectNet | 88.2% |
| Zero-shot video classification | Kinetics-400 | 76.9% |
| Object detection (PE-Spatial G) | COCO box mAP | 66.0 |
| Document QA (via PLM-8B) | DocVQA | 94.6% |
| Chart QA (via PLM-8B) | ChartQA | 85.5% |

The 86.6% average across ImageNet robustness benchmarks is highlighted as the first time an open contrastive model of this kind outperformed encoders trained on large proprietary datasets such as JFT-3B and WebLI on that aggregate measure.[1] The 66.0 box mAP on COCO detection is reported as a new absolute state of the art at the time, reached with a comparatively simple detection decoder on top of PE-Spatial rather than a heavily engineered, detection-specific backbone.[3][5] The document- and chart-understanding numbers come from PLM-8B, which uses a PE encoder, and serve as evidence that the aligned features transfer to multimodal reasoning.[4]

## How does Perception Encoder relate to the Perception Language Model?

Perception Encoder was released alongside the Perception Language Model, an open and reproducible vision-language model that uses a PE encoder as its visual front end. PLM pairs a language-aligned PE encoder with [Llama 3](/wiki/llama_3) decoders at 1B, 3B, and 8B parameters, connected by a two-layer MLP projector, and supports high-resolution image tiling and multi-frame video input. Notably, PLM was trained on synthetic and human-labeled data without distillation from external proprietary models.[2][4] Meta also released a large set of human-labeled video question-answering and spatio-temporal caption samples and a video benchmark, PLM-VideoBench, to accompany it.[2]

PE and PLM were two of five releases announced by Meta FAIR on April 17, 2025; the others were Meta Locate 3D (3D object localization from natural-language queries), the Dynamic Byte Latent Transformer, and the Collaborative Reasoner framework.[2]

## Reception

The paper was accepted to NeurIPS 2025.[8] Coverage framed PE as evidence that a single, simply trained encoder can serve as a general visual backbone, and that careful data, training, and alignment choices matter more than task-specific pretraining objectives for producing transferable features.[9] Because the models and the PE Video Dataset were released openly, PE has been used as a drop-in vision encoder in subsequent open multimodal systems.

## References

1. Bolya, Daniel; Huang, Po-Yao; Sun, Peize; et al. "Perception Encoder: The best visual embeddings are not at the output of the network." arXiv:2504.13181, April 17, 2025. https://arxiv.org/abs/2504.13181
2. Meta AI. "Advancing AI systems through progress in perception, localization, and reasoning." April 17, 2025. https://ai.meta.com/blog/meta-fair-updates-perception-localization-reasoning/
3. alphaXiv. "Perception Encoder: The best visual embeddings are not at the output of the network (overview)." https://www.alphaxiv.org/overview/2504.13181v2
4. Hugging Face. "Paper page: Perception Encoder: The best visual embeddings are not at the output of the network." https://huggingface.co/papers/2504.13181
5. Hugging Face. "facebook/PE-Spatial-G14-448 model card." https://huggingface.co/facebook/PE-Spatial-G14-448
6. facebookresearch. "perception_models: State-of-the-art Image & Video CLIP, Multimodal Large Language Models, and More." GitHub. https://github.com/facebookresearch/perception_models
7. MarkTechPost. "Meta AI Open-Sourced Perception Encoder Audiovisual (PE-AV)." December 22, 2025. https://www.marktechpost.com/2025/12/22/meta-ai-open-sourced-perception-encoder-audiovisual-pe-av-the-audiovisual-encoder-powering-sam-audio-and-large-scale-multimodal-retrieval/
8. NeurIPS. "Perception Encoder: The best visual embeddings are not at the output of the network (poster)." NeurIPS 2025. https://neurips.cc/virtual/2025/poster/118805
9. MarkTechPost. "Meta AI Introduces Perception Encoder: A Large-Scale Vision Encoder that Excels Across Several Vision Tasks for Images and Video." April 18, 2025. https://www.marktechpost.com/2025/04/18/meta-ai-introduces-perception-encoder-a-large-scale-vision-encoder-that-excels-across-several-vision-tasks-for-images-and-video/