Faster R-CNN

Computer Vision Deep Learning

22 min read

Updated Jun 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 23, 2026

Fact-checked

In review queue

Sources

20 citations

Revision

v2 · 4,305 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Faster R-CNN is a two-stage object detection model introduced by Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun in the 2015 NeurIPS paper Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. Its central contribution is the Region Proposal Network (RPN), a small fully convolutional sub-network that learns to propose candidate object boxes directly on a convolutional neural network feature map, sharing convolutional features with the downstream detector so that region proposals become, in the authors' words, "nearly cost-free." ^[1] On the very deep VGG-16 backbone the full system runs at 5 frames per second on a single GPU while reaching state-of-the-art accuracy on PASCAL VOC 2007, VOC 2012, and Microsoft COCO with only 300 proposals per image. ^[1]

The RPN replaced the slow CPU-bound Selective Search algorithm used by R-CNN and Fast R-CNN, turning proposal generation into a GPU-resident step that takes about 10 milliseconds per image instead of roughly 2 seconds. ^[1] With VGG-16 the detector reaches 73.2% mean average precision (mAP) on PASCAL VOC 2007 test and 70.4% on VOC 2012 test, the best published results at the time. ^[1] The authors frame the RPN as an attention mechanism: "the RPN component tells the unified network where to look." ^[1] In the ILSVRC and COCO 2015 competitions, Faster R-CNN and the RPN were the foundations of the first-place winning entries in several tracks. ^[1]

For about three years after its release Faster R-CNN was the dominant accuracy leader on PASCAL VOC and Microsoft COCO. Even after one-stage detectors closed the speed gap, it remained the standard two-stage baseline in computer vision research. The original paper has been cited tens of thousands of times and the architecture is implemented in every major detection toolkit, including Detectron2, MMDetection, the TensorFlow Object Detection API, and torchvision. With minor changes it became Mask R-CNN; with a Feature Pyramid Network it became the standard backbone of two-stage detection through the late 2010s.

What is the R-CNN family lineage?

Faster R-CNN is the third paper in a tightly connected sequence by Ross Girshick and collaborators. Each entry shaved away one of the bottlenecks of the previous one. Understanding the family helps explain why Faster R-CNN is structured the way it is.

Model	Year and venue	Region proposals	Feature extraction	Speed (test)	PASCAL VOC mAP	Key idea
R-CNN	CVPR 2014 (Girshick et al.)	Selective Search, ~2,000 per image, on CPU	Independent CNN forward pass per proposal	~47 s/image with VGG-16 on a GPU	53.3% on VOC 2012 (with VGG-16)	Apply CNN features to selective search proposals; fine-tune from ImageNet pretraining.
SPPnet	ECCV 2014 (He et al.)	Selective Search	One CNN forward pass; spatial pyramid pooling per region	~38x faster than R-CNN	Comparable to R-CNN	Share the CNN computation across all proposals via a spatial pyramid pool.
Fast R-CNN	ICCV 2015 (Girshick)	Selective Search (still external)	One CNN pass; RoI Pooling per proposal	~213x faster than R-CNN at test time, ~2 s/image without proposals	70.0% on VOC 2007 with VGG-16	End-to-end multi-task training; replace SVM with a softmax head; introduce RoI Pooling.
Faster R-CNN	NeurIPS 2015 (Ren et al.)	Region Proposal Network shares conv features	Same as Fast R-CNN	~5 fps with VGG-16, ~17 fps with ZF on a K40 GPU	73.2% on VOC 2007 with VGG-16	Replace Selective Search with a learned RPN; share features.
Mask R-CNN	ICCV 2017 (He et al.)	RPN	Adds a parallel mask head, RoIAlign instead of RoI Pool	~5 fps	n/a	Adds instance segmentation; fixes RoI misalignment with bilinear sampling.

R-CNN (2014)

R-CNN (Regions with CNN features) by Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik appeared at CVPR 2014. ^[3] The pipeline had four stages: (1) run Selective Search to produce roughly 2,000 category-agnostic region proposals; (2) warp each proposal to a fixed 227 by 227 pixel input; (3) push it through a CNN (originally AlexNet, later VGG-16) to extract a 4,096-dimensional feature vector; (4) classify each region with a per-class linear SVM and refine the box with a class-specific linear regressor. Non-maximum suppression deduplicated overlapping detections.

R-CNN delivered a large jump in accuracy (53.3% mAP on PASCAL VOC 2012) but it was painfully slow. ^[3] With the deeper VGG-16 backbone, detection took roughly 47 seconds per image on a GPU because the CNN was run independently on every proposal. Training was a three-stage chore: fine-tune the CNN, train SVMs, then train bounding-box regressors. Storing the per-region feature vectors on disk for SVM training also took hundreds of gigabytes.

Fast R-CNN (2015)

Ross Girshick's Fast R-CNN at ICCV 2015 fixed most of those problems with a single change: run the CNN once on the whole image, then crop region features from the shared feature map. ^[4] The crop operation was called RoI Pooling: it took an arbitrary rectangular region on the feature map, divided it into a fixed grid (typically 7 by 7), and max-pooled the values in each cell to produce a fixed-size output. Region features were then fed through two fully connected layers and split into a softmax classifier (K+1 classes including background) and a per-class bounding-box regressor.

The whole network was trained end-to-end with a multi-task loss combining log-loss for classification and smooth-L1 loss for box regression. Fast R-CNN was 9x faster to train and 213x faster at test time than R-CNN with the same VGG-16 backbone, and it achieved higher mAP on PASCAL VOC 2007 and 2012. ^[4] The remaining bottleneck was the proposal step: Selective Search ran on the CPU and took about 2 seconds per image, which dwarfed the 0.3 seconds the CNN now needed. ^[1]

Faster R-CNN (2015)

Ren, He, Girshick, and Sun's Faster R-CNN removed that last external step. Their insight was that the same convolutional feature map used by the detector could also drive a small Region Proposal Network trained to output candidate boxes. Because the RPN ran on the GPU and shared most of its computation with the detector, proposals became almost free: about 10 ms per image instead of 2 seconds. ^[1] The full system reached around 5 frames per second with the deep VGG-16 backbone on a single Nvidia K40 GPU, and 17 frames per second with the shallower ZF Net of Zeiler and Fergus. ^[1] Accuracy went up at the same time: 73.2% mAP on PASCAL VOC 2007 test and 70.4% mAP on PASCAL VOC 2012 test, the state of the art at the time. ^[1]

Mask R-CNN and beyond

In 2017 Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick extended Faster R-CNN with a third output branch that predicted a binary segmentation mask for each region of interest. The new system, Mask R-CNN, also replaced RoI Pooling with RoIAlign, a bilinear-interpolation crop that avoided the integer rounding of RoI Pool. ^[6] Mask R-CNN won the ICCV 2017 Best Paper Award and remained the de facto baseline for instance segmentation through the late 2010s. ^[6]

A closely related follow-up was Tsung-Yi Lin et al.'s Feature Pyramid Network (FPN) at CVPR 2017. ^[7] FPN added a top-down pathway with lateral connections to the ResNet backbone, producing a multi-scale feature pyramid that the RPN and detection heads could query at different levels. FPN became the default backbone for Faster R-CNN and Mask R-CNN, lifting Faster R-CNN with a ResNet-101 backbone from roughly 34 to 36+ mAP on COCO. ^[7]

How is Faster R-CNN structured?

A Faster R-CNN system has four main pieces: a backbone CNN, the Region Proposal Network, an RoI feature extractor, and a Fast R-CNN style detection head.

Backbone

The backbone is a standard image classification CNN, pre-trained on ImageNet, with the final classification layers removed. The original paper experimented with two backbones: ZF (Zeiler and Fergus, 2013), a five-conv-layer network similar to AlexNet but with smaller filters, and VGG-16 (Simonyan and Zisserman, 2014), a much deeper 13-conv-layer network. ^[1] Later implementations standardized on ResNet-50 and ResNet-101, often with FPN attached. The backbone outputs a feature map at a fixed stride (16 for VGG-16; 4, 8, 16, 32 across pyramid levels for ResNet-FPN). All subsequent operations work on this map rather than on raw pixels, so the input image size can be arbitrary.

How does the Region Proposal Network (RPN) work?

The RPN is the heart of the contribution. It is a small fully convolutional network slid over the backbone's feature map with a 3 by 3 convolution, projecting each spatial location into a 256-d (ZF) or 512-d (VGG-16) intermediate vector. ^[1] From that vector two 1 by 1 convolutional sibling layers predict, at each location:

Objectness scores for k anchor boxes: a 2-class softmax (object vs background), so 2k outputs per location.
Box regression deltas for the same k anchor boxes: 4k outputs per location, encoding refined (x, y, w, h) relative to each anchor.

An anchor is a reference bounding box of a fixed scale and aspect ratio, centered at a given spatial location on the feature map. The original paper uses 3 scales {128 by 128, 256 by 256, 512 by 512} and 3 aspect ratios {1:1, 1:2, 2:1}, giving k = 9 anchors per location. ^[1] With a typical input image of around 1000 by 600 pixels and stride 16, the RPN produces about 60 by 40 spatial locations and roughly 60 x 40 x 9 = 21,600 anchors before filtering. ^[1]

At training time, an anchor is labeled positive if it has IoU greater than or equal to 0.7 with any ground-truth box, or if it has the highest IoU with a particular ground-truth (the second rule guarantees every ground truth has at least one positive anchor). It is labeled negative if its IoU with all ground-truth boxes is less than 0.3. Anchors with intermediate IoU are ignored. A mini-batch of 256 anchors per image is sampled with a roughly 1:1 ratio of positives to negatives (padded with negatives if not enough positives are available). ^[1]

RoI feature extraction

Top-scoring RPN proposals are passed to the second stage. The feature map is cropped to the region of each proposal and resized to a fixed-size feature (typically 7 by 7 with VGG-16, or 14 by 14 in the C4 head of ResNet variants) using RoI Pooling. RoI Pool divides the proposal into a fixed grid of cells, snaps cell boundaries to integer coordinates on the feature map, and max-pools within each cell. Mask R-CNN later replaced this with RoIAlign, which uses bilinear interpolation to read sub-pixel feature values without the rounding that distorts mask boundaries. ^[6]

Detection head

The pooled RoI feature is flattened and passed through two fully connected layers (4096-d each in VGG-16 Faster R-CNN), then split into two siblings: a softmax classifier over K+1 classes (the K target classes plus a background class) and a bounding-box regressor that produces 4 offsets per non-background class. This is exactly the Fast R-CNN head, unchanged. ^[4]

How is Faster R-CNN trained?

Loss function

The full loss has four terms summed across two stages. The RPN multi-task loss for one image is:

L({p_i}, {t_i}) = (1 / N_cls) sum_i L_cls(p_i, p_i*) + lambda * (1 / N_reg) sum_i p_i* * L_reg(t_i, t_i*)

where i indexes anchors, p_i is the predicted objectness probability, p_i* is the binary ground-truth label, t_i is the predicted 4-vector of box parameters, and t_i* is the ground-truth box parameter for the matching anchor. L_cls is binary cross-entropy and L_reg is the smooth L1 (Huber) loss with the indicator p_i* zeroing out box loss for negative anchors. N_cls is the mini-batch size (256), N_reg is the number of anchor locations (about 2,400), and lambda balances the two terms (set to 10 in the paper, which makes the contributions roughly equal). ^[1]

The detection head adds a second cross-entropy classification loss over K+1 categories and a smooth-L1 box loss for the predicted class. Concretely, the four loss components are: (1) RPN classification (binary cross-entropy on objectness), (2) RPN regression (smooth L1 on anchor deltas), (3) detection classification (multiclass cross-entropy), (4) detection regression (smooth L1 on per-class box deltas). Modern joint-training implementations sum all four into a single optimization objective.

4-step alternating training

The original paper trained the network in four sequential stages: ^[1]

Initialize the backbone from ImageNet and train the RPN end-to-end.
Use the RPN's proposals to train a separate Fast R-CNN detector, also initialized from ImageNet. This detector does not yet share features with the RPN.
Initialize a fresh RPN from the detector's backbone, freeze the shared convolutional layers, and fine-tune only the RPN-specific layers. After this step the two networks share the backbone.
Keep the shared backbone frozen and fine-tune only the Fast R-CNN-specific layers using the RPN proposals from step 3.

The paper notes that joint optimization of all losses in one pass would be more elegant but harder to make stable; the alternating schedule converges reliably. ^[1] Subsequent re-implementations (notably the official Python release and Detectron2) replaced this with approximate joint training, which sums all four loss terms and back-propagates through both heads in a single mini-batch. Joint training is simpler, slightly faster, and reaches similar accuracy. ^[14]

Hyperparameters

Defaults from the original paper, widely re-used since: ^[1]

Hyperparameter	Value
Learning rate	0.001 then 0.0001 (VOC)
Momentum / weight decay	0.9 / 0.0005
RPN mini-batch (anchors)	256 per image, ~1:1 positive:negative
RoI mini-batch (proposals)	128 per image, 25% positive
Anchor positive / negative IoU	>= 0.7 / < 0.3
Smooth L1 lambda (RPN)	10
NMS IoU threshold (RPN)	0.7
Proposals kept at training / inference	2,000 / 300

How fast is Faster R-CNN at inference?

At test time the input image is resized so its shorter side is 600 pixels (the standard PASCAL VOC setting; COCO usually uses 800). The backbone produces the feature map, and the RPN scores all anchors and applies its bounding-box regression. Anchors crossing image boundaries are removed during training but kept and clipped at inference. Non-maximum suppression with an IoU threshold of 0.7 reduces tens of thousands of anchors to the top-N proposals (typically N = 300 at inference). ^[1] Each surviving proposal is then RoI-pooled, classified by the detection head, and refined by the per-class box regressor. A second per-class NMS (typical IoU threshold 0.3 to 0.5) removes overlapping detections.

With VGG-16 the original implementation reaches about 5 fps end-to-end on a single Nvidia K40 GPU; with ZF it reaches about 17 fps. ^[1] Modern implementations on top of ResNet-50-FPN run at roughly 0.04 seconds per image (around 25 fps) on contemporary GPUs. ^[14]

What benchmark results did Faster R-CNN achieve?

PASCAL VOC

The headline numbers in the 2015 paper are mean average precision at IoU 0.5 on the 4,952-image PASCAL VOC 2007 test set and the VOC 2012 test set. ^[13] The two backbones are ZF (5 conv layers, fast) and VGG-16 (13 conv layers, accurate). All results below use 300 RPN proposals per image. ^[1]

Backbone	Train data	VOC 2007 test mAP@0.5	VOC 2012 test mAP@0.5
ZF	VOC 2007 trainval	59.9%	n/a
VGG-16	VOC 2007 trainval	69.9%	n/a
VGG-16	VOC 2007 trainval + VOC 2012 trainval	73.2%	70.4%
VGG-16	VOC 07 + VOC 12 + COCO	78.8%	75.9%

For comparison, R-CNN with VGG-16 reached 66.0% on VOC 2007, and Fast R-CNN with VGG-16 reached 70.0%. Faster R-CNN therefore added 3.2 mAP over its immediate predecessor while running roughly an order of magnitude faster end-to-end. ^[1]

Microsoft COCO

Faster R-CNN was also one of the first detectors evaluated on the much harder Microsoft COCO benchmark. ^[12] COCO uses a stricter metric: the primary number is mean average precision averaged over IoU thresholds from 0.5 to 0.95 in steps of 0.05, written AP or AP@[.5:.95]. The original 2015 paper reports the following on COCO test-dev: ^[1]

Backbone	COCO AP@[.5:.95]	COCO AP@0.5
VGG-16 (Faster R-CNN, 2015 paper)	21.9%	42.7%
ResNet-101 (He et al. 2016, ILSVRC/COCO 2015 winner)	34.9%	55.7%

The ResNet-101 single-model number above won the COCO 2015 detection challenge for the MSRA team. With FPN added on top, Lin et al. (2017) reported: ^[7]

Backbone	COCO test-dev AP	AP@0.5	APs / APm / APl
Faster R-CNN, ResNet-101-FPN	36.2%	59.1%	18.2 / 39.0 / 48.2

Later Detectron2 implementations push this higher with longer training schedules and stronger backbones. Representative results from the Detectron2 model zoo with COCO instances train2017: ^[14]

Configuration	Schedule	COCO box AP	Inference (s/img)
Faster R-CNN R50-C4	1x	35.7	0.102
Faster R-CNN R50-DC5	1x	37.3	0.068
Faster R-CNN R50-FPN	1x	37.9	0.038
Faster R-CNN R50-FPN	3x	40.2	0.038
Faster R-CNN R101-FPN	3x	42.0	0.051
Faster R-CNN X101-FPN	3x	43.0	0.098

For reference, the original Faster R-CNN VGG-16 number of 21.9 AP would today be considered a starting point rather than a competitive baseline.

What are the main variants and successors?

Faster R-CNN became the trunk of a large branch of subsequent work. Some of the most important variants are listed below.

Method	Year	Change relative to Faster R-CNN	Headline COCO AP
R-FCN (Dai et al.)	NeurIPS 2016	Position-sensitive score maps; no per-RoI fully connected layers	~31.5 (ResNet-101)
Faster R-CNN + FPN	CVPR 2017	Adds top-down feature pyramid for multi-scale	36.2 (ResNet-101)
Mask R-CNN	ICCV 2017	Adds mask head; RoIAlign instead of RoI Pool	39.8 box / 35.7 mask (R-101-FPN)
Light-Head R-CNN (Li et al.)	2017	Replaces heavy two-fc head with a thin head	~41.5 (ResNet-101)
Cascade R-CNN (Cai and Vasconcelos)	CVPR 2018	Sequence of detection heads at increasing IoU thresholds	~42.8 (ResNet-101-FPN)
HTC (Hybrid Task Cascade, Chen et al.)	CVPR 2019	Cascaded detection + segmentation with semantic context	~47.1 (ResNeXt-101-FPN)
Sparse R-CNN (Sun et al.)	CVPR 2021	Replaces RPN with a fixed set of learned proposal boxes and queries	~46.4 (ResNet-101-FPN, longer schedule)
Dynamic Head (Dai et al.)	CVPR 2021	Adds attention-based dynamic head on top of FPN	47.0+ (ResNet-101)

How does Faster R-CNN compare to one-stage detectors?

Faster R-CNN's two-stage pipeline trades speed for accuracy. From around 2016 onward a parallel line of one-stage detectors traded a few mAP points for substantially higher frame rates: SSD (Liu et al., ECCV 2016), YOLO and YOLOv2/v3/v4/v5/v8 (Redmon and others), and RetinaNet (Lin et al., ICCV 2017). RetinaNet narrowed the accuracy gap with focal loss and reached 39.1 AP on COCO with ResNet-101-FPN, comparable to Faster R-CNN at similar settings. ^[9] The YOLO family later overtook two-stage detectors on the speed-accuracy frontier for many practical workloads.

Transformer-based detectors

In 2020 Carion et al. at Facebook AI introduced DETR, the first end-to-end transformer detector. ^[11] DETR replaces the RPN, anchors, and NMS with a fixed set of learned object queries and a Hungarian bipartite matching loss. Subsequent transformer detectors (Deformable DETR, DINO, DN-DETR, Mask DINO) improved training speed and accuracy and have largely become the new state of the art on COCO. Faster R-CNN remains common as a baseline and as a starting point for transfer learning, but cutting-edge research increasingly uses transformer detectors.

How is Faster R-CNN implemented in practice?

Reference implementations of Faster R-CNN are easy to find and well-tested.

Toolkit	Maintainer	Notes
Original MATLAB code	Shaoqing Ren (github.com/ShaoqingRen/faster_rcnn)	The Caffe + MATLAB implementation accompanying the 2015 paper.
py-faster-rcnn	Ross Girshick (github.com/rbgirshick/py-faster-rcnn)	The widely used Python/Caffe port; the de facto reference implementation through 2017.
Detectron / Detectron2	Facebook AI Research / Meta	Detectron2 is the modern PyTorch rewrite; provides Faster R-CNN with C4, DC5, and FPN heads on ResNet/ResNeXt backbones.
MMDetection	OpenMMLab	A modular PyTorch detection framework; ships dozens of Faster R-CNN configurations.
TensorFlow Object Detection API	Google	Provides Faster R-CNN with Inception, ResNet, and other backbones; widely used for transfer learning.
torchvision	PyTorch	The `torchvision.models.detection.fasterrcnn_resnet50_fpn` model gives a one-line ResNet-50-FPN Faster R-CNN.

A typical transfer-learning workflow uses one of the above pretrained models, replaces the final classification and box-regression layers with new ones for the target classes, and fine-tunes on a smaller dataset. Faster R-CNN tends to do well in this regime even with relatively few labeled images per class, which is why it is still a popular baseline in industrial and academic projects on the COCO dataset, PASCAL VOC, and custom domains.

What are the strengths and limitations of Faster R-CNN?

Strengths. Faster R-CNN delivers strong accuracy on the standard benchmarks of its era and remains competitive when paired with modern backbones. The two-stage pipeline cleanly separates where objects might be from what they are, which makes the network easier to debug and tune than dense one-stage detectors. The architecture is modular: swap the backbone, attach FPN, replace RoI Pool with RoIAlign, add a mask head, and the rest of the pipeline keeps working. Pretrained Faster R-CNN models transfer well to small custom datasets, which is why it remained a default starting point in many production CV pipelines well into the 2020s.

Limitations. The two-stage pipeline is inherently slower than dense one-stage detectors. The hand-engineered anchor design is sensitive: in domains with very small or very elongated objects the default anchors do poorly without retuning. RoI processing is per-region, which is hard to batch efficiently on the GPU and contributes to latency at high proposal counts. Because the RPN sees only one feature scale in the original formulation, very small objects are often missed; FPN largely resolves this but adds computation. ^[7] The RPN also depends on a non-maximum suppression step that is non-differentiable and can suppress true positives in crowded scenes. End-to-end transformer detectors like DETR address several of these issues by removing anchors and NMS entirely. ^[11]

Influence

Faster R-CNN has been cited well over 60,000 times on Google Scholar and is part of every survey of object detection published since 2016. Its specific contributions, the RPN, the anchor-and-deltas formulation, and the multi-task RPN + Fast R-CNN loss, all became standard vocabulary. The same architecture, extended with a mask head and RoIAlign, won the ICCV 2017 best paper award as Mask R-CNN; with FPN it formed the backbone of the COCO 2017 winning entries. The extended journal version of the 2015 paper appeared in IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) in June 2017 and is the most commonly cited version today. ^[2]

References

Ren, S., He, K., Girshick, R., Sun, J. (2015). "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks". *Advances in Neural Information Processing Systems* 28: 91-99. arXiv:1506.01497. https://arxiv.org/abs/1506.01497. ↩
Ren, S., He, K., Girshick, R., Sun, J. (2017). "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks". *IEEE Transactions on Pattern Analysis and Machine Intelligence* 39(6): 1137-1149. The extended journal version of the 2015 paper. ↩
Girshick, R., Donahue, J., Darrell, T., Malik, J. (2014). "Rich feature hierarchies for accurate object detection and semantic segmentation". CVPR 2014. arXiv:1311.2524. ↩
Girshick, R. (2015). "Fast R-CNN". ICCV 2015. arXiv:1504.08083. ↩
He, K., Zhang, X., Ren, S., Sun, J. (2014). "Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition" (SPPnet). ECCV 2014. arXiv:1406.4729.
He, K., Gkioxari, G., Dollar, P., Girshick, R. (2017). "Mask R-CNN". ICCV 2017 Best Paper. arXiv:1703.06870. ↩
Lin, T.-Y., Dollar, P., Girshick, R., He, K., Hariharan, B., Belongie, S. (2017). "Feature Pyramid Networks for Object Detection". CVPR 2017. arXiv:1612.03144. ↩
Cai, Z., Vasconcelos, N. (2018). "Cascade R-CNN: Delving into High Quality Object Detection". CVPR 2018. arXiv:1712.00726.
Lin, T.-Y., Goyal, P., Girshick, R., He, K., Dollar, P. (2017). "Focal Loss for Dense Object Detection" (RetinaNet). ICCV 2017. arXiv:1708.02002. ↩
Dai, J., Li, Y., He, K., Sun, J. (2016). "R-FCN: Object Detection via Region-based Fully Convolutional Networks". NeurIPS 2016. arXiv:1605.06409.
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S. (2020). "End-to-End Object Detection with Transformers" (DETR). ECCV 2020. arXiv:2005.12872. ↩
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., Zitnick, C. L. (2014). "Microsoft COCO: Common Objects in Context". ECCV 2014. arXiv:1405.0312. ↩
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., Zisserman, A. (2010). "The PASCAL Visual Object Classes (VOC) Challenge". *International Journal of Computer Vision* 88(2): 303-338. ↩
Detectron2 Model Zoo. Facebook AI Research. https://github.com/facebookresearch/detectron2/blob/main/MODEL_ZOO.md. ↩
MMDetection documentation. OpenMMLab. https://mmdetection.readthedocs.io.
torchvision detection models documentation. https://docs.pytorch.org/vision/main/models/faster_rcnn.html.
Ren, S. Faster R-CNN MATLAB implementation. https://github.com/ShaoqingRen/faster_rcnn.
Girshick, R. py-faster-rcnn. https://github.com/rbgirshick/py-faster-rcnn.
Region Based Convolutional Neural Networks. Wikipedia. https://en.wikipedia.org/wiki/Region_Based_Convolutional_Neural_Networks.
Weng, L. (2017). "Object Detection for Dummies Part 3: R-CNN Family". https://lilianweng.github.io/posts/2017-12-31-object-recognition-part-3/.

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

DETR Feature Pyramid Network (FPN)Focal loss Instance segmentation Jian Sun LVIS (Large Vocabulary Instance Segmentation)Machine learning terms/Computer Vision Mask R-CNN Object detection PASCAL VOC Pooling R-CNN (Regions with CNN features)SSD (Single Shot MultiBox Detector)Staged training YOLO (object detection)

What is the R-CNN family lineage?

R-CNN (2014)

Fast R-CNN (2015)

Faster R-CNN (2015)

Mask R-CNN and beyond

How is Faster R-CNN structured?

Backbone

How does the Region Proposal Network (RPN) work?

RoI feature extraction

Detection head

How is Faster R-CNN trained?

Loss function

4-step alternating training

Hyperparameters

How fast is Faster R-CNN at inference?

What benchmark results did Faster R-CNN achieve?

PASCAL VOC

Microsoft COCO

What are the main variants and successors?

How does Faster R-CNN compare to one-stage detectors?

Transformer-based detectors

How is Faster R-CNN implemented in practice?

What are the strengths and limitations of Faster R-CNN?

Influence

See also

References

Improve this article

Related Articles

Diffusion model

Translational invariance

Computer vision

Convolutional Filter

Convolutional Layer

Convolutional Neural Network

What links here

Related Articles

Diffusion model

Translational invariance

Computer vision

Convolutional Filter

Convolutional Layer

Convolutional Neural Network

What links here