Feature Pyramid Network (FPN)

Computer Vision Model Architecture

19 min read

Updated Jun 25, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 25, 2026

Fact-checked

In review queue

Sources

15 citations

Revision

v2 · 3,755 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

A Feature Pyramid Network (FPN) is a generic feature-extraction architecture for object detection and other dense-prediction tasks that builds a multi-scale feature pyramid with strong semantics at every level, using a bottom-up pathway, a top-down pathway, and lateral connections on top of a standard convolutional neural network backbone ^[1]. It combines low-resolution, semantically strong features with high-resolution, semantically weak features so that detectors can recognize objects of very different sizes from a single backbone forward pass. FPN was introduced by Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie in the 2017 paper "Feature Pyramid Networks for Object Detection," and it is a default ingredient in Faster R-CNN, Mask R-CNN, and RetinaNet ^[1]^[2]^[3].

The FPN paper was first posted on arXiv as 1612.03144 on 9 December 2016 and presented at CVPR 2017 ^[1]. Lin, Dollar, Girshick, and He were at Facebook AI Research (FAIR) at the time, while Hariharan and Belongie were at Cornell University and Cornell Tech. As the paper puts it, "Feature pyramids are a basic component in recognition systems for detecting objects at different scales" ^[1]. The design has since become a default ingredient in modern detection and segmentation systems, used in Mask R-CNN, RetinaNet, Panoptic FPN, FCOS, ATSS, and many other pipelines ^[1]^[2]^[3]^[7]^[8]^[9].

What is a Feature Pyramid Network?

A Feature Pyramid Network is a neck module that turns the single-scale feature maps of a convolutional backbone into a pyramid of feature maps at multiple resolutions, each carrying strong, high-level semantics. The authors state their objective directly: "Our goal is to leverage a ConvNet's pyramidal feature hierarchy, which has semantics from low to high levels, and build a feature pyramid with high-level semantics throughout" ^[1]. In practice the network "takes a single-scale image of an arbitrary size as input, and outputs proportionally sized feature maps at multiple levels, in a fully convolutional fashion" ^[1].

The key insight is that a deep network already computes features at many scales as part of its normal forward pass, but the high-resolution early layers are semantically weak. FPN fixes this by propagating semantics from the deepest layer back down into the high-resolution maps, producing a pyramid (typically labeled P2 through P6 or P7) in which every level is useful for detection. Because the pyramid is built "with marginal extra cost," FPN delivers the resolution benefits of an image pyramid at roughly the speed of a single forward pass ^[1].

Why was FPN created? (Motivation)

Detecting objects at very different scales is one of the long-standing problems in computer vision. Before FPN, four broad strategies dominated the literature, each with a clear weakness.

The first was the classic image pyramid, in which the input image is resized to several scales and the same network is run on each scale. Image pyramids produce features whose semantic strength is consistent across scales, but they multiply inference time by the number of scales and are too slow for real-time detection ^[1].

The second was running predictions on a single feature map at the top of the backbone, as in the original Faster R-CNN ^[14]. This is fast but throws away the higher-resolution intermediate features needed to localize small objects.

The third strategy was the in-network pyramidal feature hierarchy, used by SSD (Single Shot MultiBox Detector). SSD predicts on multiple feature maps at different depths of the backbone. The problem is that shallow feature maps have high resolution but weak semantics, so small objects, which can only be detected from those shallow maps, are predicted from features that have not been sufficiently abstracted ^[1].

The fourth was the side-prediction approach used by networks like the Hypercolumn or Hyper-features designs, which combine features from many layers, but typically only at a single scale.

FPN is best understood as combining the resolution of an image pyramid with the semantic strength of a deep backbone, while retaining the speed of a single forward pass. The top-down pathway propagates strong semantics from the deepest layer back into the high-resolution maps, and the lateral connections re-inject precise spatial information lost during downsampling. The full feature pyramid is produced from one forward pass of the backbone ^[1].

How does an FPN work? (Architecture)

FPN attaches a top-down pathway with lateral connections to a standard convolutional backbone. The original paper used ResNet-50 and ResNet-101, but the design works with most modern convolutional neural networks ^[1]^[15]. As the paper summarizes, "The construction of our pyramid involves a bottom-up pathway, a top-down pathway, and lateral connections" ^[1].

Bottom-up pathway

The bottom-up pathway is the feed-forward computation of the backbone. For each ResNet stage, the output of the last residual block is taken as a reference feature map and labeled by stage. With a ResNet backbone the bottom-up pathway produces feature maps {C2, C3, C4, C5} corresponding to stages 2, 3, 4, and 5, with strides 4, 8, 16, and 32 pixels with respect to the input image ^[1]. C1 is omitted because its spatial dimensions are too large to be practical given memory budgets.

Top-down pathway

The top-down pathway begins with the deepest feature map, C5, which has the smallest spatial resolution but the strongest semantics. A 1x1 convolutional layer is applied to C5 to project it to a fixed channel count (256 in the original paper), giving the top of the feature pyramid, P5 ^[1]. Higher-resolution levels are then constructed by upsampling P5 by a factor of two using nearest-neighbor interpolation and merging with the corresponding lateral feature map.

The iteration is straightforward: at each step, the previous pyramid level is upsampled 2x with nearest-neighbor interpolation, the corresponding bottom-up map is passed through a 1x1 convolution to match the channel count, and the two are added element-wise. The merged map is then smoothed by a 3x3 convolution to reduce the aliasing artifacts introduced by upsampling ^[1]. The result is a feature pyramid {P2, P3, P4, P5} with strides 4, 8, 16, and 32, all with 256 output channels.

Lateral connections

Lateral connections are the bridge between the two pathways. Each lateral connection takes a feature map from the bottom-up pathway, applies a 1x1 convolutional filter to reduce the channel count to 256, and adds the result to the upsampled top-down feature ^[1]. The 1x1 reduction is essential, because backbone stages have many more channels than the pyramid (a ResNet-50 C5 has 2048 channels, for example).

Output and channel choice

In the original paper, every pyramid level uses the same number of output channels. The authors state plainly: "We set d=256 in this paper and thus all extra convolutional layers have 256-channel outputs" ^[1]. Sharing channels across pyramid levels means the detection heads (RPN, classification, regression, mask) can be shared across levels with shared parameters, which both reduces the parameter count and produces a single set of head weights that works at every scale. P6, an extra coarser level, is created by stride-2 max-pooling of P5 and is used as an additional anchor scale by the region proposal network in the original Faster R-CNN with FPN configuration ^[1]. RetinaNet later extended this with a P7 level computed as a 3x3 stride-2 convolution applied to ReLU(P6), giving five pyramid levels {P3, P4, P5, P6, P7} for that detector ^[2].

Anchor and scale assignment

When FPN is used as the backbone for the region proposal network in Faster R-CNN, anchors of a single scale are placed on each pyramid level, with the scale chosen to match that level's stride. The original paper used anchor areas of 32 squared, 64 squared, 128 squared, 256 squared, and 512 squared pixels on {P2, P3, P4, P5, P6}, each with three aspect ratios (1:2, 1:1, 2:1) ^[1]. Different pyramid levels are therefore specialized for different object scales, which is what makes the multi-scale prediction effective.

Region of Interest assignment

When training the second-stage Fast R-CNN head, each region proposal is assigned to a single pyramid level based on its size. A small proposal is processed at a higher-resolution level (lower index, smaller stride), while a large proposal is processed at a coarser level. The paper uses a heuristic mapping based on proposal area, which keeps the head computation cheap and makes each level responsible for proposals of an appropriate scale ^[1].

Why are FPNs used in object detection? (Use in detection systems)

FPN was designed as a generic feature extractor and was demonstrated on multiple detectors in the original paper and follow-up work. The table below summarizes the most influential systems that adopted it.

System	Year	Authors	Role of FPN	Notes
Faster R-CNN with FPN	2017	Lin, Dollar, Girshick, He, Hariharan, Belongie	Backbone for both RPN and the second-stage head	Main system in the original FPN paper; substantial gains for small objects ^[1]
Mask R-CNN	2017	He, Gkioxari, Dollar, Girshick	Default backbone for box, mask, and keypoint heads	Authors note FPN outperforms a single-scale C4 backbone ^[3]
RetinaNet	2017	Lin, Goyal, Girshick, He, Dollar	FPN with levels P3 to P7 carries the dense classification and regression subnetworks	One-stage detector combined with focal loss; 39.1 AP on COCO test-dev with ResNet-101-FPN ^[2]
Panoptic FPN	2019	Kirillov, Girshick, He, Dollar	Shared FPN backbone for instance and semantic segmentation branches	Extends Mask R-CNN to panoptic segmentation ^[7]
FCOS	2019	Tian, Shen, Chen, He	FPN with levels P3 to P7 for anchor-free per-pixel classification	Each level handles a specific object-size range ^[8]
ATSS	2020	Zhang, Chi, Yao, Lei, Li	FPN backbone with adaptive sample selection across pyramid levels	Bridges anchor-based and anchor-free detection ^[9]
DetectoRS	2020	Qiao, Chen, Yuille	Recursive Feature Pyramid built on FPN	55.7% box AP on COCO test-dev when combined with switchable atrous convolution ^[10]

How much does FPN improve accuracy? (Performance gains)

The original paper reports careful ablations on the COCO detection benchmark ^[11]. Used as the proposal generator (RPN), FPN raises Average Recall (AR1k) to 56.3, an increase of 8.0 points over the single-scale ResNet baseline, and boosts performance on small objects by a large margin of 12.9 points ^[1]. For the full two-stage detector, the authors report that "our FPN ... is better than this strong baseline by 2.3 points AP and 3.8 points AP@0.5" with a ResNet-50 backbone ^[1]. Small-object AP improves substantially because the high-resolution P2 and P3 levels now carry strong semantics.

With ResNet-101 and the standard training schedule of the time, the single-model FPN-based Faster R-CNN reached 36.2% AP on COCO test-dev, which the paper describes as "state-of-the-art single-model results on the COCO detection benchmark without bells and whistles, surpassing all existing single-model entries including those from the COCO 2016 challenge winners" ^[1]. The method "can run at 5 FPS on a GPU and thus is a practical and accurate solution to multi-scale object detection" ^[1]. Inference time on a single NVIDIA M40 GPU was 0.148 seconds per image with ResNet-50, faster than the 0.32 seconds per image of the single-scale baseline (which had to use larger feature maps to compensate for the lack of pyramid) ^[1].

When FPN was combined with focal loss in RetinaNet, the ResNet-101-FPN variant reached 39.1% AP at 5 fps on COCO test-dev, surpassing all previously published one-stage and two-stage single-model results at the time of publication (ICCV 2017) ^[2]. Mask R-CNN with ResNet-101-FPN reached 35.7 mask AP on COCO instance segmentation in the original paper, again beating the 2016 challenge winner ^[3].

What variants and improvements followed FPN?

FPN is conceptually simple, and the years since 2017 have produced a long line of variants that change either the topology of the connections or the way features at different levels are fused. The table below collects the most important ones.

Variant	Year / venue	Authors	Key change versus FPN
PANet (Path Aggregation Network)	2018 / CVPR	Liu, Qi, Qin, Shi, Jia	Adds a bottom-up path on top of FPN, plus adaptive feature pooling; shortens the information path from low-level features to the head; won COCO 2017 Instance Segmentation ^[4]
NAS-FPN	2019 / CVPR	Ghiasi, Lin, Pang, Le	Uses neural architecture search to discover a feature-pyramid topology with both top-down and bottom-up connections; pushes RetinaNet to 48.3 AP at higher cost ^[5]
BiFPN	2020 / CVPR (EfficientDet)	Tan, Pang, Le	Bidirectional cross-scale connections with learnable weighted feature fusion; removes single-input nodes and stacks BiFPN modules; foundation of EfficientDet (up to 55.1 AP at 77M parameters) ^[6]
Recursive Feature Pyramid (DetectoRS)	2020 / CVPR 2021	Qiao, Chen, Yuille	Adds feedback connections from FPN outputs back into the bottom-up backbone; combined with switchable atrous convolution reaches 55.7% box AP on COCO test-dev ^[10]
Libra R-CNN BFP	2019 / CVPR	Pang, Chen, Shi, Feng, Ouyang, Lin	Balanced Feature Pyramid; rescales and integrates all levels into a single balanced map before re-distributing
AC-FPN	2020 / CVPR	Cao, Chen, Khosla et al.	Attention-guided context-aware FPN module
Adaptive Feature Pooling (PANet, 2018)	2018	Liu et al.	Pools each proposal from all pyramid levels rather than a single assigned level ^[4]

PANet is particularly notable because it was adopted as the neck of YOLOv4, and a similar PANet-style aggregation appears in YOLOv5 and YOLOv7 ^[4]. BiFPN sits at the heart of the EfficientDet family ^[6]. NAS-FPN was important less for absolute performance than for showing that learned topologies could match or beat the hand-designed FPN ^[5].

How is FPN used beyond object detection?

The pyramid idea generalizes well beyond bounding-box detection. Panoptic FPN (Kirillov, Girshick, He, and Dollar, CVPR 2019) shares a single FPN backbone between an instance segmentation branch (the Mask R-CNN heads) and a semantic segmentation branch that fuses upsampled features from each pyramid level into a per-pixel prediction at 1/4 scale ^[7]. This minimal extension provided a strong baseline for panoptic segmentation while costing very little memory or compute over Mask R-CNN.

FPN-style backbones are also widely used in semantic segmentation more broadly, in keypoint and human pose estimation (Mask R-CNN keypoint head, several top-down pose pipelines), and in dense regression tasks like depth estimation and optical flow ^[3]. The HRNet family takes a different but related approach by maintaining multiple resolutions in parallel throughout the backbone instead of building the pyramid only at the end.

What are the implementation details of FPN?

The original FPN configuration uses several specific design choices that have largely become defaults in subsequent work ^[1].

Detail	Choice
Backbone stages used	C2, C3, C4, C5 (ResNet stages 2-5)
Pyramid output channels	256 at every level (d = 256)
Lateral projection	1x1 convolution to 256 channels
Top-down upsampling	Nearest-neighbor, factor 2
Smoothing after merge	3x3 convolution, 256 channels
Extra coarse level	P6 = stride-2 max-pool of P5 (Faster R-CNN); P6 = 3x3 stride-2 conv on C5 and P7 = ReLU then 3x3 stride-2 conv on P6 in RetinaNet
RPN anchor scales (Faster R-CNN with FPN)	32 squared, 64 squared, 128 squared, 256 squared, 512 squared on {P2, P3, P4, P5, P6}
RPN anchor aspect ratios	1:2, 1:1, 2:1
Region-to-level assignment	Heuristic based on RoI area; small RoIs go to higher-resolution levels
Head weights across levels	Shared

FPN is built into all major detection frameworks. Detectron2 ships an FPN module that wraps any backbone and exposes the standard {P2..P5} (and optionally P6, P7) interface ^[12]. MMDetection includes implementations of FPN, PAFPN (PANet-style), BiFPN, NAS-FPN, and several other variants, all behind a common neck API ^[13]. PyTorch's torchvision.models.detection package includes a ResNet50-FPN backbone for Faster R-CNN, Mask R-CNN, and RetinaNet.

Why does FPN work?

Three properties make FPN effective in practice. First, every pyramid level inherits strong semantics through the top-down propagation, so the high-resolution levels needed to detect small objects are no longer semantically weak ^[1]. Second, every level has spatial resolution appropriate to its responsible scale range, so there is no need to either downsample the input (slow) or detect from a deep, low-resolution map (poor for small objects). Third, the entire pyramid is produced by a single backbone forward pass, with only the lateral 1x1 convolutions, the upsampling operations, and the smoothing 3x3 convolutions added on top. The computational overhead over a plain backbone is small in comparison to the detection heads themselves ^[1].

A fourth, more practical property is parameter sharing. Because every pyramid level has the same channel count, a single set of detection head weights can be applied at all levels. This is what allows RetinaNet to share its dense classification and regression subnetworks across P3 through P7, and what allows Faster R-CNN with FPN to share its RPN ^[2].

What are the strengths and limitations of FPN?

Strengths

FPN is generic in the sense that it can be attached to almost any classification backbone with a few stages of downsampling. It is multi-scale by design, with each level specialized for an appropriate range of object sizes. It improves small-object detection meaningfully, which is precisely where pre-FPN single-scale detectors were weakest ^[1]. It adds modest computational overhead, mostly the lateral and smoothing convolutions and the shallow upsampling operations. From 2017 onward it has been a standard component of essentially every competitive object detector built on convolutional backbones.

Limitations

FPN keeps several feature maps at different resolutions, which raises memory consumption compared to a single-scale backbone. The fusion is a simple element-wise sum of an upsampled top-down feature and a 1x1-projected lateral feature; later work like BiFPN argued that learnable weighted fusion is meaningfully better ^[6]. The connection topology is hand-designed, which NAS-FPN attempted to replace with a learned topology and found measurable gains ^[5]. More recently, transformer-based detectors of the DETR family have shown that competitive detection is possible without an explicit feature pyramid at all, by attending to multi-scale information through self-attention. In practice these approaches now coexist: many transformer-based detectors still use FPN-style multi-scale features, while pure transformer pipelines like DETR and Deformable DETR use different strategies.

How influential has FPN been?

The original FPN paper has been cited tens of thousands of times and is one of the most-cited works in object detection from the 2010s ^[1]. Most COCO leaderboard entries between 2017 and 2020 used FPN, a PANet-style variant of FPN, or a BiFPN. The pyramid design idea has spread well beyond detection, into segmentation, keypoint estimation, human-pose pipelines, and dense regression tasks. FPN remains the default neck for SSD-like, YOLO-like, and Faster R-CNN-like detectors, with PANet-style additions in many of the most recent variants ^[4].

ELI5: Explain FPN like I am five

Imagine you are looking for objects in a photo, both tiny ones (like a faraway bird) and huge ones (like a bus right in front of you). A computer looks at a picture by shrinking it down step by step into smaller and smaller versions: the small versions "understand" what things are but are blurry, and the big versions are sharp but do not understand much. A Feature Pyramid Network is a clever trick that pours the "understanding" from the small blurry versions back up into the big sharp versions, so now the computer has a stack (a pyramid) of pictures that are both sharp and smart at every size. That lets it spot tiny objects and big objects all at once, without having to look at the photo many separate times.

References

Lin, T.-Y., Dollar, P., Girshick, R., He, K., Hariharan, B., and Belongie, S. (2017). "Feature Pyramid Networks for Object Detection." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 936-944. arXiv:1612.03144. ↩
Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Dollar, P. (2017). "Focal Loss for Dense Object Detection." Proceedings of the IEEE International Conference on Computer Vision (ICCV). arXiv:1708.02002. ↩
He, K., Gkioxari, G., Dollar, P., and Girshick, R. (2017). "Mask R-CNN." Proceedings of the IEEE International Conference on Computer Vision (ICCV). arXiv:1703.06870. ↩
Liu, S., Qi, L., Qin, H., Shi, J., and Jia, J. (2018). "Path Aggregation Network for Instance Segmentation." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:1803.01534. ↩
Ghiasi, G., Lin, T.-Y., Pang, R., and Le, Q. V. (2019). "NAS-FPN: Learning Scalable Feature Pyramid Architecture for Object Detection." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:1904.07392. ↩
Tan, M., Pang, R., and Le, Q. V. (2020). "EfficientDet: Scalable and Efficient Object Detection." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:1911.09070. ↩
Kirillov, A., Girshick, R., He, K., and Dollar, P. (2019). "Panoptic Feature Pyramid Networks." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:1901.02446. ↩
Tian, Z., Shen, C., Chen, H., and He, T. (2019). "FCOS: Fully Convolutional One-Stage Object Detection." Proceedings of the IEEE International Conference on Computer Vision (ICCV). arXiv:1904.01355. ↩
Zhang, S., Chi, C., Yao, Y., Lei, Z., and Li, S. Z. (2020). "Bridging the Gap Between Anchor-Based and Anchor-Free Detection via Adaptive Training Sample Selection." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:1912.02424. ↩
Qiao, S., Chen, L.-C., and Yuille, A. (2021). "DetectoRS: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:2006.02334. ↩
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., and Zitnick, C. L. (2014). "Microsoft COCO: Common Objects in Context." European Conference on Computer Vision (ECCV). arXiv:1405.0312. ↩
Detectron2 Documentation. Facebook AI Research. https://detectron2.readthedocs.io/. ↩
MMDetection Documentation. OpenMMLab. https://mmdetection.readthedocs.io/. ↩
Ren, S., He, K., Girshick, R., and Sun, J. (2015). "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks." Advances in Neural Information Processing Systems (NIPS). arXiv:1506.01497. ↩
He, K., Zhang, X., Ren, S., and Sun, J. (2016). "Deep Residual Learning for Image Recognition." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:1512.03385. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

CVPR (Conference on Computer Vision and Pattern Recognition)Focal loss SAM 2 Size invariance Swin Transformer YOLO (object detection)

What is a Feature Pyramid Network?

Why was FPN created? (Motivation)

How does an FPN work? (Architecture)

Bottom-up pathway

Top-down pathway

Lateral connections

Output and channel choice

Anchor and scale assignment

Region of Interest assignment

Why are FPNs used in object detection? (Use in detection systems)

How much does FPN improve accuracy? (Performance gains)

What variants and improvements followed FPN?

How is FPN used beyond object detection?

What are the implementation details of FPN?

Why does FPN work?

What are the strengths and limitations of FPN?

Strengths

Limitations

How influential has FPN been?

ELI5: Explain FPN like I am five

See also

References

Improve this article

Related Articles

Depthwise separable convolutional neural network (sepCNN)

Vision Transformer

Depthwise Separable CNN

LSTM

Bidirectional

Encoder

What links here

Related Articles

Depthwise separable convolutional neural network (sepCNN)

Vision Transformer

Depthwise Separable CNN

LSTM

Bidirectional

Encoder

What links here