R-CNN (Regions with CNN features)
Last reviewed
May 1, 2026
Sources
21 citations
Review status
Source-backed
Revision
v1 ยท 3,999 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 1, 2026
Sources
21 citations
Review status
Source-backed
Revision
v1 ยท 3,999 words
Add missing citations, update stale details, or suggest a clearer explanation.
R-CNN (short for Regions with CNN features) is a two-stage object detection method introduced by Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik at UC Berkeley. It was first posted to arXiv as technical report 1311.2524 on 11 November 2013 and presented as an oral at CVPR 2014 under the title Rich feature hierarchies for accurate object detection and semantic segmentation. R-CNN combined bottom-up region proposals (Selective Search) with high-capacity convolutional neural network features and was the first deep-learning approach to convincingly beat hand-crafted feature methods on PASCAL VOC, jumping mean average precision (mAP) on PASCAL VOC 2012 from the prior best of 40.4% to 53.3%, a more than 30% relative improvement.
The paper is widely considered the moment when deep learning took over object detection. It launched the "R-CNN family" of two-stage detectors, including SPPnet, Fast R-CNN, Faster R-CNN, Mask R-CNN, and Cascade R-CNN, and has been cited over 30,000 times on Google Scholar. Girshick received the 2024 Longuet-Higgins Prize from the IEEE Computer Society for the original R-CNN paper, awarded for fundamental contributions in computer vision that have stood the test of time. Although R-CNN itself has been superseded for production use, every modern two-stage detector traces its lineage to this 2014 system.
In 2012, Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton's AlexNet won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), dropping top-5 image-classification error from roughly 26% to 15.3%. That result revolutionized image classification overnight, but its impact on related tasks was less obvious. Object detection in particular was still dominated by hand-crafted features. The reigning approach was Pedro Felzenszwalb's Deformable Parts Model (DPM), combining Histogram of Oriented Gradients (HOG) descriptors with a star-structured part-based model trained by latent SVM. DPM had won the PASCAL VOC "Lifetime Achievement" prize in 2010, and HOG/DPM detectors had crept up to 33-40% mAP after several years of incremental work.
The central question in late 2012 and early 2013 was whether the deep features that had transformed classification could also transform detection. OverFeat, by Pierre Sermanet and others at NYU, was an early attempt that ran a CNN as a sliding-window detector. R-CNN took a different route: rather than slide a CNN over every possible window, let an inexpensive bottom-up segmentation algorithm propose a few thousand candidate regions and then score each one with a deep CNN. The proposal algorithm of choice was Selective Search, introduced by Jasper Uijlings, Koen van de Sande, Theo Gevers, and Arnold Smeulders in their 2013 International Journal of Computer Vision paper. Selective Search produced about 2,000 category-independent proposals per image with around 99% recall on PASCAL VOC. The combination, region proposals plus CNN features, worked shockingly well: the first arXiv version of R-CNN appeared 11 November 2013, and by CVPR 2014 the entire research field was pivoting to deep learning for detection.
All four authors of the R-CNN paper were at UC Berkeley at the time the work was done.
| Author | Role | Affiliation in 2013-2014 |
|---|---|---|
| Ross B. Girshick | Lead author | Postdoctoral researcher, EECS, UC Berkeley (joined Microsoft Research in 2014) |
| Jeff Donahue | Co-author, also led the parallel DeCAF features work | PhD student, EECS, UC Berkeley |
| Trevor Darrell | Advisor | Professor of EECS, UC Berkeley |
| Jitendra Malik | Advisor | Arthur J. Chick Professor of EECS, UC Berkeley |
Girshick had completed his PhD at the University of Chicago under Pedro Felzenszwalb, working on the very DPM detector that R-CNN would supplant. He went on to lead Fast R-CNN at Microsoft Research, then to join Facebook AI Research in 2015 and the Allen Institute for AI in 2023. Donahue's contemporaneous DeCAF paper showed that AlexNet features transferred well to many downstream tasks, providing much of the conceptual basis for R-CNN's transfer-learning step. Malik in particular had been pushing his students toward CNN-based detection in late 2012 and 2013.
R-CNN is a four-module pipeline plus a non-maximum suppression post-processing step.
For each input image, Selective Search produces about 2,000 category-independent rectangular proposals. The algorithm first oversegments the image into many small regions using the graph-based method of Felzenszwalb and Huttenlocher, then iteratively merges adjacent regions that are similar in color, texture, size, and shape. Each merged region in the hierarchy contributes a candidate bounding box. R-CNN uses Selective Search in its "fast mode"; it runs on the CPU and takes around 2 seconds per image, which becomes a dominant fixed cost of the system.
Each proposal is cropped and warped to a fixed 227 by 227 pixel input, regardless of its original aspect ratio. The paper experimented with several warping strategies and settled on anisotropic warp with 16 pixels of context padding around the original proposal. The warped crop is pushed through a CNN that produces a 4,096-dimensional feature vector at the second-to-last fully connected layer (fc7). The original R-CNN used Krizhevsky et al.'s AlexNet architecture: 5 convolutional layers, 2 fully connected layers (fc6 and fc7, each 4,096-d), and a final 1,000-way ImageNet classification layer that is discarded for detection. The paper also explored Matthew Zeiler and Rob Fergus's improved CNN. In follow-on work using the deeper VGG-16 architecture by Simonyan and Zisserman, the same pipeline reached substantially higher accuracy at a much higher compute cost.
For each object category there is a separate binary linear SVM that takes the 4,096-d feature vector and outputs a class confidence score. With 20 categories on PASCAL VOC, that is 20 SVMs. The SVMs are trained on features extracted from all proposals in the training set, with hard negative mining.
A per-class linear bounding-box regressor takes the pool5 features of a proposal and predicts four offsets (dx, dy, dw, dh) refining the proposal to better match the ground-truth box. Bounding-box regression added a substantial chunk of mAP, lifting AlexNet R-CNN from 50.2% to 53.7% on PASCAL VOC 2010 and from 54.2% to 58.5% on PASCAL VOC 2007. Finally, non-maximum suppression is applied per class: detections are sorted by SVM score and any detection with IoU greater than a threshold (typically 0.3) against a higher-scoring detection of the same class is suppressed.
R-CNN's training is famously a multi-stage pipeline, in contrast to the end-to-end training of its successors.
The CNN is pre-trained on the ImageNet ILSVRC 2012 classification task using image-level labels only (1.2 million training images, 1,000 classes). The weights are downloaded from Krizhevsky's published model. The transfer-learning insight, that a CNN trained on a generic large image-classification task could be adapted to detection with much less labeled data, was one of the paper's most influential contributions. The paper notes that the "supervised pre-training/domain-specific fine-tuning paradigm will be highly effective for a variety of data-scarce vision problems." Nearly every CNN-based vision system released between 2014 and 2020 followed the same recipe.
The pre-trained CNN is then fine-tuned on the target detection dataset (e.g., PASCAL VOC trainval). The 1,000-way ImageNet classifier head is replaced with a randomly initialized (N+1)-way classifier (N object classes plus background; N = 20 for PASCAL VOC, so the new head is 21-way). Fine-tuning treats Selective Search proposals as training images. A proposal with intersection-over-union (IoU) of at least 0.5 with any ground-truth box of class C is labeled as a positive for class C; proposals with IoU less than 0.5 with all ground-truth boxes are labeled as background. Mini-batches of size 128 are built from 32 positive windows (across all classes) and 96 background windows. Stochastic gradient descent runs at a learning rate of 0.001, one tenth of the initial pre-training rate, to avoid overwriting the pre-trained weights.
After fine-tuning, the (N+1)-way classifier is discarded and the 4,096-d fc7 features become the SVM inputs. SVM training uses a different labeling rule than fine-tuning, which the paper acknowledges as a quirk: positives are only the ground-truth boxes themselves; negatives are proposals with IoU less than 0.3 to any ground-truth box; proposals with IoU between 0.3 and 0.5 are ignored. The 0.3 threshold was chosen by grid search; setting it to 0.5 reduced mAP by 5 points and setting it to 0 reduced mAP by 4 points. To handle the very large number of negatives, the paper uses hard negative mining: a few rounds of training and re-evaluation, each round adding the most-confidently-misclassified negatives to the training set. Hard negative mining converges quickly; the paper notes that mAP stops increasing after a single pass over all images.
A per-class linear regressor is trained on pool5 features to predict the four bounding-box offsets. The regressor uses ridge regression with a regularization strength of 1,000 and only sees proposals with IoU of at least 0.6 with a ground-truth box.
A practical curiosity of R-CNN training is the disk-space cost. Because the CNN is treated as a fixed feature extractor for SVM and regressor training, the 4,096-d features for every proposal in the training set must be stored. For PASCAL VOC trainval the cache occupies tens of gigabytes; for the full ILSVRC 2013 detection set, it can reach hundreds of gigabytes. This was one of the explicit motivations for the end-to-end design of Fast R-CNN, which never materializes per-region features outside the GPU.
The R-CNN paper reported state-of-the-art numbers on three benchmarks: PASCAL VOC 2010, PASCAL VOC 2012, and ILSVRC 2013 detection. Exact numbers depend on the CNN backbone (AlexNet vs. VGG-16) and on whether bounding-box regression is used.
| Dataset | Method (backbone) | mAP@0.5 |
|---|---|---|
| VOC 2007 test | DPM v5 (HOG, no CNN) | 33.7% |
| VOC 2007 test | R-CNN (AlexNet + bbox reg.) | 58.5% |
| VOC 2007 test | R-CNN (VGG-16 + bbox reg.) | 66.0% |
| VOC 2010 test | DPM (Felzenszwalb et al.) | 33.4% |
| VOC 2010 test | UVA (Selective Search + Bag of Words) | 35.1% |
| VOC 2010 test | R-CNN (AlexNet, no bbox reg.) | 50.2% |
| VOC 2010 test | R-CNN (AlexNet + bbox reg.) | 53.7% |
| VOC 2012 test | DPM (prior published best) | 40.4% |
| VOC 2012 test | R-CNN (AlexNet + bbox reg.) | 53.3% |
| VOC 2012 test | R-CNN (VGG-16 + bbox reg.) | 62.4% |
The headline 53.3% mAP on PASCAL VOC 2012 was a more than 30% relative improvement over the published state of the art. With the deeper VGG-16 backbone, the same pipeline reached 66.0% on VOC 2007, a number that would have seemed implausible only a year earlier.
R-CNN won the 2013 ImageNet 200-class detection challenge by a large margin.
| Method | ILSVRC 2013 detection mAP |
|---|---|
| OverFeat (NYU, sliding-window CNN) | 24.3% |
| UvA-Euvision (van de Sande et al.) | 22.6% |
| R-CNN (Girshick et al., AlexNet) | 31.4% |
The 7.1-point gap to OverFeat was the largest in the challenge.
R-CNN's accuracy gains came at a steep computational cost. Test-time numbers per image (including Selective Search and CNN feature extraction):
| Backbone | Hardware | Time per image |
|---|---|---|
| AlexNet | NVIDIA Tesla K20 GPU | ~13 seconds |
| AlexNet | CPU only | ~53 seconds |
| VGG-16 | NVIDIA K40 GPU | ~47 seconds |
The per-class SVM and regressor steps are negligible compared with the CNN forward pass. The cost is dominated by running the CNN once per proposal, roughly 2,000 times per image, with no shared computation. Every successor in the R-CNN family attacked this bottleneck.
R-CNN was a research breakthrough but a pragmatic mess. The paper itself is candid about the limitations.
Every one of these limitations was addressed by a subsequent paper in the R-CNN family.
R-CNN started a research lineage in which each follow-up removed a bottleneck while preserving the two-stage structure of "propose, then classify and regress."
| Model | Year, venue | Region proposals | Test speed | Headline result | Key idea |
|---|---|---|---|---|---|
| R-CNN | CVPR 2014, Girshick et al. | Selective Search (~2,000/image, CPU) | ~47 s/image (VGG-16, K40 GPU) | 53.3% mAP on VOC 2012 | CNN features on Selective Search proposals; pretrain on ImageNet, fine-tune on detection. |
| SPPnet | ECCV 2014, He et al. | Selective Search | ~24-102x faster than R-CNN | Comparable to R-CNN on VOC 2007 | Share CNN computation across all proposals via spatial pyramid pool. |
| Fast R-CNN | ICCV 2015, Girshick | Selective Search | 0.3 s/image (no proposals); 213x faster than R-CNN | 70.0% mAP on VOC 2007 (VGG-16) | End-to-end multi-task training; softmax replaces SVM; RoI Pooling. |
| Faster R-CNN | NeurIPS 2015, Ren et al. | Region Proposal Network shares conv features | ~5 fps (VGG-16), ~17 fps (ZF) | 73.2% mAP on VOC 2007 (VGG-16) | Replace Selective Search with a learned RPN; share features. |
| Mask R-CNN | ICCV 2017, He et al. | RPN | ~5 fps | 39.8 box / 35.7 mask AP on COCO (R-101-FPN) | Adds instance segmentation; RoIAlign instead of RoI Pool. |
| Cascade R-CNN | CVPR 2018, Cai & Vasconcelos | RPN | similar to Faster R-CNN | ~42.8 AP on COCO (R-101-FPN) | Cascade of heads trained at progressively higher IoU thresholds. |
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun's SPPnet at ECCV 2014 asked the obvious question: why run the CNN 2,000 times when you could run it once? Their answer was a spatial pyramid pooling layer that pooled features from arbitrary rectangular regions of a single shared feature map at multiple scales, producing a fixed-length descriptor per proposal. SPPnet was 24 to 102 times faster than R-CNN at test time with similar or better accuracy on PASCAL VOC 2007.
Girshick's Fast R-CNN at ICCV 2015 generalized SPPnet and made the whole detector end-to-end trainable. RoI Pooling divided each proposal's feature region into a fixed grid (typically 7 by 7) and max-pooled within each cell. The pooled feature was fed through two fully connected layers, then split into a softmax classifier (K+1 classes) and a per-class bounding-box regressor, all trained with a single multi-task loss. Fast R-CNN was 9x faster to train and 213x faster at test time than R-CNN with the same VGG-16 backbone. The remaining bottleneck was Selective Search, which still ran on the CPU.
Faster R-CNN by Ren, He, Girshick, and Sun (NeurIPS 2015) removed that last external step with a Region Proposal Network, a small fully convolutional sub-network that ran on the shared feature map and predicted candidate boxes directly using anchors of multiple scales and aspect ratios. Because the RPN ran on the GPU and shared most of its computation with the detector, proposals became almost free: about 10 ms per image instead of 2 seconds. The full system reached 5 fps with VGG-16 on a K40 GPU and 73.2% mAP on PASCAL VOC 2007.
Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick extended Faster R-CNN in 2017 with a third output branch that produced a binary segmentation mask for each region of interest. Mask R-CNN also replaced RoI Pooling with RoIAlign, a bilinear-interpolation crop that avoided the integer rounding of RoI Pool and substantially improved mask quality. It won the ICCV 2017 Best Paper Award. R-FCN (Dai et al., NeurIPS 2016) used position-sensitive score maps to remove the per-RoI fully connected layers. Cascade R-CNN (Cai and Vasconcelos, CVPR 2018) chained multiple heads trained at progressively higher IoU thresholds, lifting COCO AP by about 4 points. Sparse R-CNN, Dynamic Head, HTC, and Light-Head R-CNN continued to refine the architecture into the early 2020s.
From around 2016 onward a parallel line of one-stage detectors emerged, trading a few mAP points for higher frame rates by skipping the explicit proposal step.
| Family | Lead papers | Approach | Trade-off vs. R-CNN family |
|---|---|---|---|
| YOLO | Redmon et al., CVPR 2016 (v1) | Single CNN predicts boxes and class scores from a fixed grid in one pass | Much faster (45+ fps in v1); lower accuracy initially, competitive later |
| SSD | Liu et al., ECCV 2016 | Single CNN with multi-scale feature maps and default boxes per location | Faster than Faster R-CNN; competitive on PASCAL VOC |
| RetinaNet | Lin, Goyal, Girshick, He, Dollar, ICCV 2017 | Single-stage detector with focal loss | Closes accuracy gap with two-stage detectors; 39.1 AP on COCO (R-101-FPN) |
| DETR | Carion et al., ECCV 2020 | End-to-end transformer with object queries; no anchors, no NMS | Removes proposal-and-NMS machinery; later transformer detectors became state of the art from 2021 |
Two-stage detectors of the R-CNN family generally win on accuracy in cluttered or small-object regimes; one-stage detectors win on speed and simplicity. From 2020 the transformer-based DETR family began to displace both for cutting-edge research, although Faster R-CNN and Mask R-CNN remain widely used as transfer-learning starting points.
The original R-CNN code was released by Ross Girshick on GitHub in May 2014 under the Simplified BSD License at github.com/rbgirshick/rcnn. It is implemented in MATLAB on top of an early version of Caffe, the deep learning framework developed at UC Berkeley by Yangqing Jia and others. The repository requires Caffe v0.999 and is no longer actively maintained, but it is preserved as the historical artifact for the CVPR 2014 paper and the extended TPAMI 2016 journal version.
| Toolkit | Maintainer | Coverage |
|---|---|---|
| Original R-CNN | Ross Girshick (github.com/rbgirshick/rcnn) | Reference MATLAB + Caffe code. |
| py-faster-rcnn | Ross Girshick | Python/Caffe port of Faster R-CNN; reference through 2017. |
| Detectron / Detectron2 | Facebook AI Research / Meta | Modern PyTorch implementation of Fast / Faster / Mask / Cascade R-CNN. |
| MMDetection | OpenMMLab | Modular PyTorch detection framework. |
| TensorFlow Object Detection API | Faster R-CNN and Mask R-CNN. | |
| torchvision | PyTorch | One-line fasterrcnn_resnet50_fpn model. |
For practitioners who want R-CNN-style object detection today, Faster R-CNN with a ResNet-50-FPN backbone in Detectron2 or torchvision is the closest spiritual descendant of the 2014 system, and it runs hundreds of times faster while reaching about twice the COCO accuracy.
R-CNN had an immediate and lasting effect on computer vision research. It was the first paper to convincingly demonstrate that the CNN revolution from ImageNet 2012 would carry over to object detection, and it set the template, backbone CNN + region proposals + per-class classifier head, that defined two-stage detection for the rest of the decade. The paper has over 30,000 Google Scholar citations, won the 2024 Longuet-Higgins Prize from the IEEE Computer Society, and inspired at least five major direct-descendant papers (SPPnet, Fast R-CNN, Faster R-CNN, Mask R-CNN, Cascade R-CNN). It also established the "pretrain on ImageNet, fine-tune on the target task" transfer-learning recipe as a default for vision tasks for nearly a decade.
By 2017 R-CNN itself had largely been displaced in production by Faster R-CNN and Mask R-CNN, which were faster and more accurate. From around 2020 the transformer-based DETR family began to take over the cutting-edge benchmarks. Even so, R-CNN remains the historical reference point for deep-learning object detection. Surveys routinely begin with R-CNN as the dividing line between the hand-crafted-feature era (HOG, DPM, Selective Search + Bag of Words) and the modern CNN-based era. The two key insights of the paper, that bottom-up region proposals plus high-capacity CNN features could outperform sliding-window methods, and that ImageNet pre-training plus task-specific fine-tuning could close the data gap for detection, are still widely used over a decade later.