Feature Pyramid Network (FPN)
Last reviewed
May 1, 2026
Sources
15 citations
Review status
Source-backed
Revision
v1 ยท 3,161 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 1, 2026
Sources
15 citations
Review status
Source-backed
Revision
v1 ยท 3,161 words
Add missing citations, update stale details, or suggest a clearer explanation.
Feature Pyramid Network (FPN) is a generic feature-extraction architecture for object detection and other dense-prediction tasks that combines low-resolution, semantically strong features with high-resolution, semantically weak features through a top-down pathway and lateral connections. The result is a multi-scale feature pyramid in which every level carries rich semantics, allowing detectors to recognize objects of very different sizes from a single backbone forward pass.
FPN was introduced by Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie in the paper "Feature Pyramid Networks for Object Detection," first posted on arXiv as 1612.03144 on 9 December 2016 and presented at CVPR 2017. Lin, Dollar, Girshick, and He were at Facebook AI Research (FAIR) at the time, while Hariharan and Belongie were at Cornell University and Cornell Tech. The design has since become a default ingredient in modern detection and segmentation systems, used in Mask R-CNN, RetinaNet, Panoptic FPN, FCOS, ATSS, and many other pipelines.
Detecting objects at very different scales is one of the long-standing problems in computer vision. Before FPN, four broad strategies dominated the literature, each with a clear weakness.
The first was the classic image pyramid, in which the input image is resized to several scales and the same network is run on each scale. Image pyramids produce features whose semantic strength is consistent across scales, but they multiply inference time by the number of scales and are too slow for real-time detection.
The second was running predictions on a single feature map at the top of the backbone, as in the original Faster R-CNN. This is fast but throws away the higher-resolution intermediate features needed to localize small objects.
The third strategy was the in-network pyramidal feature hierarchy, used by SSD (Single Shot MultiBox Detector). SSD predicts on multiple feature maps at different depths of the backbone. The problem is that shallow feature maps have high resolution but weak semantics, so small objects, which can only be detected from those shallow maps, are predicted from features that have not been sufficiently abstracted.
The fourth was the side-prediction approach used by networks like the Hypercolumn or Hyper-features designs, which combine features from many layers, but typically only at a single scale.
FPN is best understood as combining the resolution of an image pyramid with the semantic strength of a deep backbone, while retaining the speed of a single forward pass. The top-down pathway propagates strong semantics from the deepest layer back into the high-resolution maps, and the lateral connections re-inject precise spatial information lost during downsampling. The full feature pyramid is produced from one forward pass of the backbone.
FPN attaches a top-down pathway with lateral connections to a standard convolutional backbone. The original paper used ResNet-50 and ResNet-101, but the design works with most modern convolutional neural networks.
The bottom-up pathway is the feed-forward computation of the backbone. For each ResNet stage, the output of the last residual block is taken as a reference feature map and labeled by stage. With a ResNet backbone the bottom-up pathway produces feature maps {C2, C3, C4, C5} corresponding to stages 2, 3, 4, and 5, with strides 4, 8, 16, and 32 pixels with respect to the input image. C1 is omitted because its spatial dimensions are too large to be practical given memory budgets.
The top-down pathway begins with the deepest feature map, C5, which has the smallest spatial resolution but the strongest semantics. A 1x1 convolutional layer is applied to C5 to project it to a fixed channel count (256 in the original paper), giving the top of the feature pyramid, P5. Higher-resolution levels are then constructed by upsampling P5 by a factor of two using nearest-neighbor interpolation and merging with the corresponding lateral feature map.
The iteration is straightforward: at each step, the previous pyramid level is upsampled 2x with nearest-neighbor interpolation, the corresponding bottom-up map is passed through a 1x1 convolution to match the channel count, and the two are added element-wise. The merged map is then smoothed by a 3x3 convolution to reduce the aliasing artifacts introduced by upsampling. The result is a feature pyramid {P2, P3, P4, P5} with strides 4, 8, 16, and 32, all with 256 output channels.
Lateral connections are the bridge between the two pathways. Each lateral connection takes a feature map from the bottom-up pathway, applies a 1x1 convolutional filter to reduce the channel count to 256, and adds the result to the upsampled top-down feature. The 1x1 reduction is essential, because backbone stages have many more channels than the pyramid (a ResNet-50 C5 has 2048 channels, for example).
In the original paper, every pyramid level uses the same number of output channels, fixed at 256. Sharing channels across pyramid levels means the detection heads (RPN, classification, regression, mask) can be shared across levels with shared parameters, which both reduces the parameter count and produces a single set of head weights that works at every scale. P6, an extra coarser level, is created by stride-2 max-pooling of P5 and is used as an additional anchor scale by the region proposal network in the original Faster R-CNN with FPN configuration. RetinaNet later extended this with a P7 level computed as a 3x3 stride-2 convolution applied to ReLU(P6), giving five pyramid levels {P3, P4, P5, P6, P7} for that detector.
When FPN is used as the backbone for the region proposal network in Faster R-CNN, anchors of a single scale are placed on each pyramid level, with the scale chosen to match that level's stride. The original paper used anchor areas of 32 squared, 64 squared, 128 squared, 256 squared, and 512 squared pixels on {P2, P3, P4, P5, P6}, each with three aspect ratios (1:2, 1:1, 2:1). Different pyramid levels are therefore specialized for different object scales, which is what makes the multi-scale prediction effective.
When training the second-stage Fast R-CNN head, each region proposal is assigned to a single pyramid level based on its size. A small proposal is processed at a higher-resolution level (lower index, smaller stride), while a large proposal is processed at a coarser level. The paper uses a heuristic mapping based on proposal area, which keeps the head computation cheap and makes each level responsible for proposals of an appropriate scale.
FPN was designed as a generic feature extractor and was demonstrated on multiple detectors in the original paper and follow-up work. The table below summarizes the most influential systems that adopted it.
| System | Year | Authors | Role of FPN | Notes |
|---|---|---|---|---|
| Faster R-CNN with FPN | 2017 | Lin, Dollar, Girshick, He, Hariharan, Belongie | Backbone for both RPN and the second-stage head | Main system in the original FPN paper; substantial gains for small objects |
| Mask R-CNN | 2017 | He, Gkioxari, Dollar, Girshick | Default backbone for box, mask, and keypoint heads | Authors note FPN outperforms a single-scale C4 backbone |
| RetinaNet | 2017 | Lin, Goyal, Girshick, He, Dollar | FPN with levels P3 to P7 carries the dense classification and regression subnetworks | One-stage detector combined with focal loss; 39.1 AP on COCO test-dev with ResNet-101-FPN |
| Panoptic FPN | 2019 | Kirillov, Girshick, He, Dollar | Shared FPN backbone for instance and semantic segmentation branches | Extends Mask R-CNN to panoptic segmentation |
| FCOS | 2019 | Tian, Shen, Chen, He | FPN with levels P3 to P7 for anchor-free per-pixel classification | Each level handles a specific object-size range |
| ATSS | 2020 | Zhang, Chi, Yao, Lei, Li | FPN backbone with adaptive sample selection across pyramid levels | Bridges anchor-based and anchor-free detection |
| DetectoRS | 2020 | Qiao, Chen, Yuille | Recursive Feature Pyramid built on FPN | 55.7% box AP on COCO test-dev when combined with switchable atrous convolution |
The original paper reports careful ablations on the COCO detection benchmark. With a ResNet-50 backbone, replacing a single-scale Faster R-CNN baseline with the FPN-based variant raises the COCO-style Average Precision (AP) by 2.3 points and the PASCAL-style AP (AP at IoU 0.5) by 3.8 points. Small-object AP improves substantially because the high-resolution P2 and P3 levels now carry strong semantics. Used as the proposal generator (RPN), FPN raises Average Recall by 8.0 points over the single-scale ResNet baseline and by 12.9 points on small objects.
With ResNet-101 and the standard training schedule of the time, the single-model FPN-based Faster R-CNN reached 36.2% AP on COCO test-dev, beating the COCO 2016 challenge winners without the heavy ensembling those entries relied on. Inference time on a single NVIDIA M40 GPU was 0.148 seconds per image with ResNet-50, faster than the 0.32 seconds per image of the single-scale baseline (which had to use larger feature maps to compensate for the lack of pyramid).
When FPN was combined with focal loss in RetinaNet, the ResNet-101-FPN variant reached 39.1% AP at 5 fps on COCO test-dev, surpassing all previously published one-stage and two-stage single-model results at the time of publication (ICCV 2017). Mask R-CNN with ResNet-101-FPN reached 35.7 mask AP on COCO instance segmentation in the original paper, again beating the 2016 challenge winner.
FPN is conceptually simple, and the years since 2017 have produced a long line of variants that change either the topology of the connections or the way features at different levels are fused. The table below collects the most important ones.
| Variant | Year / venue | Authors | Key change versus FPN |
|---|---|---|---|
| PANet (Path Aggregation Network) | 2018 / CVPR | Liu, Qi, Qin, Shi, Jia | Adds a bottom-up path on top of FPN, plus adaptive feature pooling; shortens the information path from low-level features to the head; won COCO 2017 Instance Segmentation |
| NAS-FPN | 2019 / CVPR | Ghiasi, Lin, Pang, Le | Uses neural architecture search to discover a feature-pyramid topology with both top-down and bottom-up connections; pushes RetinaNet to 48.3 AP at higher cost |
| BiFPN | 2020 / CVPR (EfficientDet) | Tan, Pang, Le | Bidirectional cross-scale connections with learnable weighted feature fusion; removes single-input nodes and stacks BiFPN modules; foundation of EfficientDet (up to 55.1 AP at 77M parameters) |
| Recursive Feature Pyramid (DetectoRS) | 2020 / CVPR 2021 | Qiao, Chen, Yuille | Adds feedback connections from FPN outputs back into the bottom-up backbone; combined with switchable atrous convolution reaches 55.7% box AP on COCO test-dev |
| Libra R-CNN BFP | 2019 / CVPR | Pang, Chen, Shi, Feng, Ouyang, Lin | Balanced Feature Pyramid; rescales and integrates all levels into a single balanced map before re-distributing |
| AC-FPN | 2020 / CVPR | Cao, Chen, Khosla et al. | Attention-guided context-aware FPN module |
| Adaptive Feature Pooling (PANet, 2018) | 2018 | Liu et al. | Pools each proposal from all pyramid levels rather than a single assigned level |
PANet is particularly notable because it was adopted as the neck of YOLOv4, and a similar PANet-style aggregation appears in YOLOv5 and YOLOv7. BiFPN sits at the heart of the EfficientDet family. NAS-FPN was important less for absolute performance than for showing that learned topologies could match or beat the hand-designed FPN.
The pyramid idea generalizes well beyond bounding-box detection. Panoptic FPN (Kirillov, Girshick, He, and Dollar, CVPR 2019) shares a single FPN backbone between an instance segmentation branch (the Mask R-CNN heads) and a semantic segmentation branch that fuses upsampled features from each pyramid level into a per-pixel prediction at 1/4 scale. This minimal extension provided a strong baseline for panoptic segmentation while costing very little memory or compute over Mask R-CNN.
FPN-style backbones are also widely used in semantic segmentation more broadly, in keypoint and human pose estimation (Mask R-CNN keypoint head, several top-down pose pipelines), and in dense regression tasks like depth estimation and optical flow. The HRNet family takes a different but related approach by maintaining multiple resolutions in parallel throughout the backbone instead of building the pyramid only at the end.
The original FPN configuration uses several specific design choices that have largely become defaults in subsequent work.
| Detail | Choice |
|---|---|
| Backbone stages used | C2, C3, C4, C5 (ResNet stages 2-5) |
| Pyramid output channels | 256 at every level |
| Lateral projection | 1x1 convolution to 256 channels |
| Top-down upsampling | Nearest-neighbor, factor 2 |
| Smoothing after merge | 3x3 convolution, 256 channels |
| Extra coarse level | P6 = stride-2 max-pool of P5 (Faster R-CNN); P6 = 3x3 stride-2 conv on C5 and P7 = ReLU then 3x3 stride-2 conv on P6 in RetinaNet |
| RPN anchor scales (Faster R-CNN with FPN) | 32 squared, 64 squared, 128 squared, 256 squared, 512 squared on {P2, P3, P4, P5, P6} |
| RPN anchor aspect ratios | 1:2, 1:1, 2:1 |
| Region-to-level assignment | Heuristic based on RoI area; small RoIs go to higher-resolution levels |
| Head weights across levels | Shared |
FPN is built into all major detection frameworks. Detectron2 ships an FPN module that wraps any backbone and exposes the standard {P2..P5} (and optionally P6, P7) interface. MMDetection includes implementations of FPN, PAFPN (PANet-style), BiFPN, NAS-FPN, and several other variants, all behind a common neck API. PyTorch's torchvision.models.detection package includes a ResNet50-FPN backbone for Faster R-CNN, Mask R-CNN, and RetinaNet.
Three properties make FPN effective in practice. First, every pyramid level inherits strong semantics through the top-down propagation, so the high-resolution levels needed to detect small objects are no longer semantically weak. Second, every level has spatial resolution appropriate to its responsible scale range, so there is no need to either downsample the input (slow) or detect from a deep, low-resolution map (poor for small objects). Third, the entire pyramid is produced by a single backbone forward pass, with only the lateral 1x1 convolutions, the upsampling operations, and the smoothing 3x3 convolutions added on top. The computational overhead over a plain backbone is small in comparison to the detection heads themselves.
A fourth, more practical property is parameter sharing. Because every pyramid level has the same channel count, a single set of detection head weights can be applied at all levels. This is what allows RetinaNet to share its dense classification and regression subnetworks across P3 through P7, and what allows Faster R-CNN with FPN to share its RPN.
FPN is generic in the sense that it can be attached to almost any classification backbone with a few stages of downsampling. It is multi-scale by design, with each level specialized for an appropriate range of object sizes. It improves small-object detection meaningfully, which is precisely where pre-FPN single-scale detectors were weakest. It adds modest computational overhead, mostly the lateral and smoothing convolutions and the shallow upsampling operations. From 2017 onward it has been a standard component of essentially every competitive object detector built on convolutional backbones.
FPN keeps several feature maps at different resolutions, which raises memory consumption compared to a single-scale backbone. The fusion is a simple element-wise sum of an upsampled top-down feature and a 1x1-projected lateral feature; later work like BiFPN argued that learnable weighted fusion is meaningfully better. The connection topology is hand-designed, which NAS-FPN attempted to replace with a learned topology and found measurable gains. More recently, transformer-based detectors of the DETR family have shown that competitive detection is possible without an explicit feature pyramid at all, by attending to multi-scale information through self-attention. In practice these approaches now coexist: many transformer-based detectors still use FPN-style multi-scale features, while pure transformer pipelines like DETR and Deformable DETR use different strategies.
The original FPN paper has been cited tens of thousands of times and is one of the most-cited works in object detection from the 2010s. Most COCO leaderboard entries between 2017 and 2020 used FPN, a PANet-style variant of FPN, or a BiFPN. The pyramid design idea has spread well beyond detection, into segmentation, keypoint estimation, human-pose pipelines, and dense regression tasks. FPN remains the default neck for SSD-like, YOLO-like, and Faster R-CNN-like detectors, with PANet-style additions in many of the most recent variants.