Faster R-CNN
Last reviewed
Apr 30, 2026
Sources
20 citations
Review status
Source-backed
Revision
v1 ยท 4,086 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 30, 2026
Sources
20 citations
Review status
Source-backed
Revision
v1 ยท 4,086 words
Add missing citations, update stale details, or suggest a clearer explanation.
Faster R-CNN is a two-stage object detection model introduced by Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun in the 2015 NeurIPS paper Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. Its central contribution is the Region Proposal Network (RPN), a small fully convolutional sub-network that learns to propose candidate object boxes directly on a convolutional neural network feature map. By sharing convolutional features with the downstream detection head, the RPN turns region proposals into a near cost-free, GPU-resident step, replacing the slow CPU-bound Selective Search algorithm used by R-CNN and Fast R-CNN.
For about three years after its release Faster R-CNN was the dominant accuracy leader on PASCAL VOC and Microsoft COCO. Even after one-stage detectors closed the speed gap, it remained the standard two-stage baseline in computer vision research. The original paper has been cited tens of thousands of times and the architecture is implemented in every major detection toolkit, including Detectron2, MMDetection, the TensorFlow Object Detection API, and torchvision. With minor changes it became Mask R-CNN; with a Feature Pyramid Network it became the standard backbone of two-stage detection through the late 2010s.
Faster R-CNN is the third paper in a tightly connected sequence by Ross Girshick and collaborators. Each entry shaved away one of the bottlenecks of the previous one. Understanding the family helps explain why Faster R-CNN is structured the way it is.
| Model | Year and venue | Region proposals | Feature extraction | Speed (test) | PASCAL VOC mAP | Key idea |
|---|---|---|---|---|---|---|
| R-CNN | CVPR 2014 (Girshick et al.) | Selective Search, ~2,000 per image, on CPU | Independent CNN forward pass per proposal | ~47 s/image with VGG-16 on a GPU | 53.3% on VOC 2012 (with VGG-16) | Apply CNN features to selective search proposals; fine-tune from ImageNet pretraining. |
| SPPnet | ECCV 2014 (He et al.) | Selective Search | One CNN forward pass; spatial pyramid pooling per region | ~38x faster than R-CNN | Comparable to R-CNN | Share the CNN computation across all proposals via a spatial pyramid pool. |
| Fast R-CNN | ICCV 2015 (Girshick) | Selective Search (still external) | One CNN pass; RoI Pooling per proposal | ~213x faster than R-CNN at test time, ~2 s/image without proposals | 70.0% on VOC 2007 with VGG-16 | End-to-end multi-task training; replace SVM with a softmax head; introduce RoI Pooling. |
| Faster R-CNN | NeurIPS 2015 (Ren et al.) | Region Proposal Network shares conv features | Same as Fast R-CNN | ~5 fps with VGG-16, ~17 fps with ZF on a K40 GPU | 73.2% on VOC 2007 with VGG-16 | Replace Selective Search with a learned RPN; share features. |
| Mask R-CNN | ICCV 2017 (He et al.) | RPN | Adds a parallel mask head, RoIAlign instead of RoI Pool | ~5 fps | n/a | Adds instance segmentation; fixes RoI misalignment with bilinear sampling. |
R-CNN (Regions with CNN features) by Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik appeared at CVPR 2014. The pipeline had four stages: (1) run Selective Search to produce roughly 2,000 category-agnostic region proposals; (2) warp each proposal to a fixed 227 by 227 pixel input; (3) push it through a CNN (originally AlexNet, later VGG-16) to extract a 4,096-dimensional feature vector; (4) classify each region with a per-class linear SVM and refine the box with a class-specific linear regressor. Non-maximum suppression deduplicated overlapping detections.
R-CNN delivered a large jump in accuracy (53.3% mAP on PASCAL VOC 2012) but it was painfully slow. With the deeper VGG-16 backbone, detection took roughly 47 seconds per image on a GPU because the CNN was run independently on every proposal. Training was a three-stage chore: fine-tune the CNN, train SVMs, then train bounding-box regressors. Storing the per-region feature vectors on disk for SVM training also took hundreds of gigabytes.
Ross Girshick's Fast R-CNN at ICCV 2015 fixed most of those problems with a single change: run the CNN once on the whole image, then crop region features from the shared feature map. The crop operation was called RoI Pooling: it took an arbitrary rectangular region on the feature map, divided it into a fixed grid (typically 7 by 7), and max-pooled the values in each cell to produce a fixed-size output. Region features were then fed through two fully connected layers and split into a softmax classifier (K+1 classes including background) and a per-class bounding-box regressor.
The whole network was trained end-to-end with a multi-task loss combining log-loss for classification and smooth-L1 loss for box regression. Fast R-CNN was 9x faster to train and 213x faster at test time than R-CNN with the same VGG-16 backbone, and it achieved higher mAP on PASCAL VOC 2007 and 2012. The remaining bottleneck was the proposal step: Selective Search ran on the CPU and took about 2 seconds per image, which dwarfed the 0.3 seconds the CNN now needed.
Ren, He, Girshick, and Sun's Faster R-CNN removed that last external step. Their insight was that the same convolutional feature map used by the detector could also drive a small Region Proposal Network trained to output candidate boxes. Because the RPN ran on the GPU and shared most of its computation with the detector, proposals became almost free: 10 ms per image instead of 2 seconds. The full system reached around 5 frames per second with the deep VGG-16 backbone on a single Nvidia K40 GPU, and 17 frames per second with the shallower ZF Net of Zeiler and Fergus. Accuracy went up at the same time: 73.2% mAP on PASCAL VOC 2007 test and 70.4% mAP on PASCAL VOC 2012 test, the state of the art at the time.
In 2017 Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick extended Faster R-CNN with a third output branch that predicted a binary segmentation mask for each region of interest. The new system, Mask R-CNN, also replaced RoI Pooling with RoIAlign, a bilinear-interpolation crop that avoided the integer rounding of RoI Pool. Mask R-CNN won the ICCV 2017 Best Paper Award and remained the de facto baseline for instance segmentation through the late 2010s.
A closely related follow-up was Tsung-Yi Lin et al.'s Feature Pyramid Network (FPN) at CVPR 2017. FPN added a top-down pathway with lateral connections to the ResNet backbone, producing a multi-scale feature pyramid that the RPN and detection heads could query at different levels. FPN became the default backbone for Faster R-CNN and Mask R-CNN, lifting Faster R-CNN with a ResNet-101 backbone from roughly 34 to 36+ mAP on COCO.
A Faster R-CNN system has four main pieces: a backbone CNN, the Region Proposal Network, an RoI feature extractor, and a Fast R-CNN style detection head.
The backbone is a standard image classification CNN, pre-trained on ImageNet, with the final classification layers removed. The original paper experimented with two backbones: ZF (Zeiler and Fergus, 2013), a five-conv-layer network similar to AlexNet but with smaller filters, and VGG-16 (Simonyan and Zisserman, 2014), a much deeper 13-conv-layer network. Later implementations standardized on ResNet-50 and ResNet-101, often with FPN attached. The backbone outputs a feature map at a fixed stride (16 for VGG-16; 4, 8, 16, 32 across pyramid levels for ResNet-FPN). All subsequent operations work on this map rather than on raw pixels, so the input image size can be arbitrary.
The RPN is the heart of the contribution. It is a small fully convolutional network slid over the backbone's feature map with a 3 by 3 convolution, projecting each spatial location into a 256-d (ZF) or 512-d (VGG-16) intermediate vector. From that vector two 1 by 1 convolutional sibling layers predict, at each location:
An anchor is a reference bounding box of a fixed scale and aspect ratio, centered at a given spatial location on the feature map. The original paper uses 3 scales {128 by 128, 256 by 256, 512 by 512} and 3 aspect ratios {1:1, 1:2, 2:1}, giving k = 9 anchors per location. With a typical input image of around 1000 by 600 pixels and stride 16, the RPN produces about 60 by 40 spatial locations and roughly 60 x 40 x 9 = 21,600 anchors before filtering.
At training time, an anchor is labeled positive if it has IoU greater than or equal to 0.7 with any ground-truth box, or if it has the highest IoU with a particular ground-truth (the second rule guarantees every ground truth has at least one positive anchor). It is labeled negative if its IoU with all ground-truth boxes is less than 0.3. Anchors with intermediate IoU are ignored. A mini-batch of 256 anchors per image is sampled with a roughly 1:1 ratio of positives to negatives (padded with negatives if not enough positives are available).
Top-scoring RPN proposals are passed to the second stage. The feature map is cropped to the region of each proposal and resized to a fixed-size feature (typically 7 by 7 with VGG-16, or 14 by 14 in the C4 head of ResNet variants) using RoI Pooling. RoI Pool divides the proposal into a fixed grid of cells, snaps cell boundaries to integer coordinates on the feature map, and max-pools within each cell. Mask R-CNN later replaced this with RoIAlign, which uses bilinear interpolation to read sub-pixel feature values without the rounding that distorts mask boundaries.
The pooled RoI feature is flattened and passed through two fully connected layers (4096-d each in VGG-16 Faster R-CNN), then split into two siblings: a softmax classifier over K+1 classes (the K target classes plus a background class) and a bounding-box regressor that produces 4 offsets per non-background class. This is exactly the Fast R-CNN head, unchanged.
The full loss has four terms summed across two stages. The RPN multi-task loss for one image is:
L({p_i}, {t_i}) = (1 / N_cls) sum_i L_cls(p_i, p_i*) + lambda * (1 / N_reg) sum_i p_i* * L_reg(t_i, t_i*)
where i indexes anchors, p_i is the predicted objectness probability, p_i* is the binary ground-truth label, t_i is the predicted 4-vector of box parameters, and t_i* is the ground-truth box parameter for the matching anchor. L_cls is binary cross-entropy and L_reg is the smooth L1 (Huber) loss with the indicator p_i* zeroing out box loss for negative anchors. N_cls is the mini-batch size (256), N_reg is the number of anchor locations (about 2,400), and lambda balances the two terms (set to 10 in the paper, which makes the contributions roughly equal).
The detection head adds a second cross-entropy classification loss over K+1 categories and a smooth-L1 box loss for the predicted class. Concretely, the four loss components are: (1) RPN classification (binary cross-entropy on objectness), (2) RPN regression (smooth L1 on anchor deltas), (3) detection classification (multiclass cross-entropy), (4) detection regression (smooth L1 on per-class box deltas). Modern joint-training implementations sum all four into a single optimization objective.
The original paper trained the network in four sequential stages:
The paper notes that joint optimization of all losses in one pass would be more elegant but harder to make stable; the alternating schedule converges reliably. Subsequent re-implementations (notably the official Python release and Detectron2) replaced this with approximate joint training, which sums all four loss terms and back-propagates through both heads in a single mini-batch. Joint training is simpler, slightly faster, and reaches similar accuracy.
Defaults from the original paper, widely re-used since:
| Hyperparameter | Value |
|---|---|
| Learning rate | 0.001 then 0.0001 (VOC) |
| Momentum / weight decay | 0.9 / 0.0005 |
| RPN mini-batch (anchors) | 256 per image, ~1:1 positive:negative |
| RoI mini-batch (proposals) | 128 per image, 25% positive |
| Anchor positive / negative IoU | >= 0.7 / < 0.3 |
| Smooth L1 lambda (RPN) | 10 |
| NMS IoU threshold (RPN) | 0.7 |
| Proposals kept at training / inference | 2,000 / 300 |
At test time the input image is resized so its shorter side is 600 pixels (the standard PASCAL VOC setting; COCO usually uses 800). The backbone produces the feature map, and the RPN scores all anchors and applies its bounding-box regression. Anchors crossing image boundaries are removed during training but kept and clipped at inference. Non-maximum suppression with an IoU threshold of 0.7 reduces tens of thousands of anchors to the top-N proposals (typically N = 300 at inference). Each surviving proposal is then RoI-pooled, classified by the detection head, and refined by the per-class box regressor. A second per-class NMS (typical IoU threshold 0.3 to 0.5) removes overlapping detections.
With VGG-16 the original implementation reaches about 5 fps end-to-end on a single Nvidia K40 GPU; with ZF it reaches about 17 fps. Modern implementations on top of ResNet-50-FPN run at roughly 0.04 seconds per image (around 25 fps) on contemporary GPUs.
The headline numbers in the 2015 paper are mean average precision at IoU 0.5 on the 4,952-image PASCAL VOC 2007 test set and the VOC 2012 test set. The two backbones are ZF (5 conv layers, fast) and VGG-16 (13 conv layers, accurate). All results below use 300 RPN proposals per image.
| Backbone | Train data | VOC 2007 test mAP@0.5 | VOC 2012 test mAP@0.5 |
|---|---|---|---|
| ZF | VOC 2007 trainval | 59.9% | n/a |
| VGG-16 | VOC 2007 trainval | 69.9% | n/a |
| VGG-16 | VOC 2007 trainval + VOC 2012 trainval | 73.2% | 70.4% |
| VGG-16 | VOC 07 + VOC 12 + COCO | 78.8% | 75.9% |
For comparison, R-CNN with VGG-16 reached 66.0% on VOC 2007, and Fast R-CNN with VGG-16 reached 70.0%. Faster R-CNN therefore added 3.2 mAP over its immediate predecessor while running roughly an order of magnitude faster end-to-end.
Faster R-CNN was also one of the first detectors evaluated on the much harder Microsoft COCO benchmark. COCO uses a stricter metric: the primary number is mean average precision averaged over IoU thresholds from 0.5 to 0.95 in steps of 0.05, written AP or AP@[.5:.95]. The original 2015 paper reports the following on COCO test-dev:
| Backbone | COCO AP@[.5:.95] | COCO AP@0.5 |
|---|---|---|
| VGG-16 (Faster R-CNN, 2015 paper) | 21.9% | 42.7% |
| ResNet-101 (He et al. 2016, ILSVRC/COCO 2015 winner) | 34.9% | 55.7% |
The ResNet-101 single-model number above won the COCO 2015 detection challenge for the MSRA team. With FPN added on top, Lin et al. (2017) reported:
| Backbone | COCO test-dev AP | AP@0.5 | APs / APm / APl |
|---|---|---|---|
| Faster R-CNN, ResNet-101-FPN | 36.2% | 59.1% | 18.2 / 39.0 / 48.2 |
Later Detectron2 implementations push this higher with longer training schedules and stronger backbones. Representative results from the Detectron2 model zoo with COCO instances train2017:
| Configuration | Schedule | COCO box AP | Inference (s/img) |
|---|---|---|---|
| Faster R-CNN R50-C4 | 1x | 35.7 | 0.102 |
| Faster R-CNN R50-DC5 | 1x | 37.3 | 0.068 |
| Faster R-CNN R50-FPN | 1x | 37.9 | 0.038 |
| Faster R-CNN R50-FPN | 3x | 40.2 | 0.038 |
| Faster R-CNN R101-FPN | 3x | 42.0 | 0.051 |
| Faster R-CNN X101-FPN | 3x | 43.0 | 0.098 |
For reference, the original Faster R-CNN VGG-16 number of 21.9 AP would today be considered a starting point rather than a competitive baseline.
Faster R-CNN became the trunk of a large branch of subsequent work. Some of the most important variants are listed below.
| Method | Year | Change relative to Faster R-CNN | Headline COCO AP |
|---|---|---|---|
| R-FCN (Dai et al.) | NeurIPS 2016 | Position-sensitive score maps; no per-RoI fully connected layers | ~31.5 (ResNet-101) |
| Faster R-CNN + FPN | CVPR 2017 | Adds top-down feature pyramid for multi-scale | 36.2 (ResNet-101) |
| Mask R-CNN | ICCV 2017 | Adds mask head; RoIAlign instead of RoI Pool | 39.8 box / 35.7 mask (R-101-FPN) |
| Light-Head R-CNN (Li et al.) | 2017 | Replaces heavy two-fc head with a thin head | ~41.5 (ResNet-101) |
| Cascade R-CNN (Cai and Vasconcelos) | CVPR 2018 | Sequence of detection heads at increasing IoU thresholds | ~42.8 (ResNet-101-FPN) |
| HTC (Hybrid Task Cascade, Chen et al.) | CVPR 2019 | Cascaded detection + segmentation with semantic context | ~47.1 (ResNeXt-101-FPN) |
| Sparse R-CNN (Sun et al.) | CVPR 2021 | Replaces RPN with a fixed set of learned proposal boxes and queries | ~46.4 (ResNet-101-FPN, longer schedule) |
| Dynamic Head (Dai et al.) | CVPR 2021 | Adds attention-based dynamic head on top of FPN | 47.0+ (ResNet-101) |
Faster R-CNN's two-stage pipeline trades speed for accuracy. From around 2016 onward a parallel line of one-stage detectors traded a few mAP points for substantially higher frame rates: SSD (Liu et al., ECCV 2016), YOLO and YOLOv2/v3/v4/v5/v8 (Redmon and others), and RetinaNet (Lin et al., ICCV 2017). RetinaNet narrowed the accuracy gap with focal loss and reached 39.1 AP on COCO with ResNet-101-FPN, comparable to Faster R-CNN at similar settings. The YOLO family later overtook two-stage detectors on the speed-accuracy frontier for many practical workloads.
In 2020 Carion et al. at Facebook AI introduced DETR, the first end-to-end transformer detector. DETR replaces the RPN, anchors, and NMS with a fixed set of learned object queries and a Hungarian bipartite matching loss. Subsequent transformer detectors (Deformable DETR, DINO, DN-DETR, Mask DINO) improved training speed and accuracy and have largely become the new state of the art on COCO. Faster R-CNN remains common as a baseline and as a starting point for transfer learning, but cutting-edge research increasingly uses transformer detectors.
Reference implementations of Faster R-CNN are easy to find and well-tested.
| Toolkit | Maintainer | Notes |
|---|---|---|
| Original MATLAB code | Shaoqing Ren (github.com/ShaoqingRen/faster_rcnn) | The Caffe + MATLAB implementation accompanying the 2015 paper. |
| py-faster-rcnn | Ross Girshick (github.com/rbgirshick/py-faster-rcnn) | The widely used Python/Caffe port; the de facto reference implementation through 2017. |
| Detectron / Detectron2 | Facebook AI Research / Meta | Detectron2 is the modern PyTorch rewrite; provides Faster R-CNN with C4, DC5, and FPN heads on ResNet/ResNeXt backbones. |
| MMDetection | OpenMMLab | A modular PyTorch detection framework; ships dozens of Faster R-CNN configurations. |
| TensorFlow Object Detection API | Provides Faster R-CNN with Inception, ResNet, and other backbones; widely used for transfer learning. | |
| torchvision | PyTorch | The torchvision.models.detection.fasterrcnn_resnet50_fpn model gives a one-line ResNet-50-FPN Faster R-CNN. |
A typical transfer-learning workflow uses one of the above pretrained models, replaces the final classification and box-regression layers with new ones for the target classes, and fine-tunes on a smaller dataset. Faster R-CNN tends to do well in this regime even with relatively few labeled images per class, which is why it is still a popular baseline in industrial and academic projects on the COCO dataset, PASCAL VOC, and custom domains.
Strengths. Faster R-CNN delivers strong accuracy on the standard benchmarks of its era and remains competitive when paired with modern backbones. The two-stage pipeline cleanly separates where objects might be from what they are, which makes the network easier to debug and tune than dense one-stage detectors. The architecture is modular: swap the backbone, attach FPN, replace RoI Pool with RoIAlign, add a mask head, and the rest of the pipeline keeps working. Pretrained Faster R-CNN models transfer well to small custom datasets, which is why it remained a default starting point in many production CV pipelines well into the 2020s.
Limitations. The two-stage pipeline is inherently slower than dense one-stage detectors. The hand-engineered anchor design is sensitive: in domains with very small or very elongated objects the default anchors do poorly without retuning. RoI processing is per-region, which is hard to batch efficiently on the GPU and contributes to latency at high proposal counts. Because the RPN sees only one feature scale in the original formulation, very small objects are often missed; FPN largely resolves this but adds computation. The RPN also depends on a non-maximum suppression step that is non-differentiable and can suppress true positives in crowded scenes. End-to-end transformer detectors like DETR address several of these issues by removing anchors and NMS entirely.
Faster R-CNN has been cited well over 60,000 times on Google Scholar and is part of every survey of object detection published since 2016. Its specific contributions, the RPN, the anchor-and-deltas formulation, and the multi-task RPN + Fast R-CNN loss, all became standard vocabulary. The same architecture, extended with a mask head and RoIAlign, won the ICCV 2017 best paper award as Mask R-CNN; with FPN it formed the backbone of the COCO 2017 winning entries. The extended journal version of the 2015 paper appeared in IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) in June 2017 and is the most commonly cited version today.