Mask R-CNN
Last reviewed
Apr 30, 2026
Sources
22 citations
Review status
Source-backed
Revision
v1 ยท 3,580 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 30, 2026
Sources
22 citations
Review status
Source-backed
Revision
v1 ยท 3,580 words
Add missing citations, update stale details, or suggest a clearer explanation.
Mask R-CNN is a deep convolutional neural network for instance segmentation introduced by Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick at Facebook AI Research. The paper appeared at the 2017 IEEE International Conference on Computer Vision (ICCV) in Venice and won the Marr Prize for Best Paper. The method extends the Faster R-CNN two-stage detector by adding a third parallel head that predicts a binary segmentation mask for every region of interest, alongside the existing classification and bounding-box regression branches. The result is a single network that simultaneously performs object detection and pixel-accurate instance segmentation, and the same architecture, with the mask head replaced by a heatmap head, also produces state-of-the-art human keypoint detection.
Mask R-CNN was released as open source by Facebook AI Research first inside the Detectron framework (Caffe2) and later in Detectron2 (PyTorch). It quickly became the default baseline for instance segmentation research and remained the dominant baseline for years. The paper has been cited tens of thousands of times and the central technical contribution, RoIAlign, is now standard machinery in two-stage detectors.
Mask R-CNN is the fourth model in a lineage of region-based convolutional networks from Ross Girshick and collaborators. Each step in the family addresses a bottleneck in the previous one.
| Model | Year | Authors | Key idea | Speed limit |
|---|---|---|---|---|
| R-CNN | 2014 | Girshick, Donahue, Darrell, Malik | Run a CNN on each of ~2,000 selective-search proposals; SVMs classify | Per-region forward pass |
| SPPnet | 2014 | He, Zhang, Ren, Sun | Spatial pyramid pooling shares conv features across proposals | Multi-stage training |
| Fast R-CNN | 2015 | Girshick | Single shared CNN, RoI pooling, single-stage training, end-to-end except proposals | Selective search proposals |
| Faster R-CNN | 2015 | Ren, He, Girshick, Sun | Region Proposal Network (RPN) replaces selective search; trained jointly | Coarse RoIPool quantisation |
| Mask R-CNN | 2017 | He, Gkioxari, Dollar, Girshick | Adds parallel mask branch; replaces RoIPool with RoIAlign | Two-stage, anchor-based |
Faster R-CNN provided the detection scaffolding. The contribution of Mask R-CNN was to show that pixel-level instance segmentation could be added cheaply, by reusing the same RoI features the box head already consumed, and that doing so even improved bounding-box accuracy. Importantly, the paper showed the architecture generalises beyond masks. Replace the mask branch with a heatmap regressor and you get a competitive human pose estimator (Keypoint R-CNN). Replace it with a dense surface coordinate predictor and you get DensePose.
Mask R-CNN keeps every component of Faster R-CNN intact and bolts on a small fully convolutional mask head.
The backbone extracts features from the input image. The original paper experiments with three options.
| Backbone | Notes |
|---|---|
| ResNet-50 / ResNet-101 (C4) | Faster R-CNN style, head at the C4 layer |
| ResNet-50-FPN / ResNet-101-FPN | Multi-scale features via Feature Pyramid Network |
| ResNeXt-101-FPN | Best accuracy in the paper, used for headline numbers |
The Feature Pyramid Network (FPN) of Lin et al. (CVPR 2017) is the recommended backbone. FPN fuses high-resolution low-level features with semantically rich high-level features through a top-down pathway with lateral connections, giving the detector a feature pyramid that handles small and large objects equally well. Backbones are pre-trained on ImageNet classification before being fine-tuned on COCO.
The RPN is identical to the one in Faster R-CNN. It slides a small network over the convolutional feature map, predicting at each anchor whether the location contains an object and a coarse bounding box adjustment. With FPN, the RPN runs at every pyramid level, and anchors of different scales are assigned to the level whose stride matches their size. Roughly 1,000 to 2,000 proposals survive non-maximum suppression and are fed to the heads.
For every proposal that survives the RPN, three heads run in parallel:
This parallel structure differs from earlier instance segmentation systems (DeepMask, MNC, FCIS) which produced masks first and classified them afterwards. Decoupling the two tasks is one of the keys to the method working as well as it does.
The mask head is a small fully convolutional network. The standard configuration on the FPN variant is four 3x3 convolutional layers with 256 channels, followed by a 2x2 transposed convolution that upsamples to 28x28, and a 1x1 convolution that produces K binary masks (one per class). The output is treated independently per class. There is no inter-class competition because the loss applies a per-pixel sigmoid rather than a softmax across classes.
A mask resolution of 28x28 sounds tiny, but the mask is defined inside the RoI, then bilinearly upsampled and pasted back into the image during inference. For typical COCO objects the resolution is sufficient. Fine boundary details, however, are limited by this design, and later methods like PointRend specifically address it.
Faster R-CNN extracts a fixed-size feature map for every proposal using RoIPool. RoIPool quantises the floating-point RoI coordinates twice. First, the RoI itself is rounded to integer feature-map grid cells. Second, the RoI is divided into bins (typically 7x7) and the bin boundaries are again rounded to integer positions before max pooling. Each rounding can shift features by up to a stride of 16 pixels in the original image. For classification this is mostly harmless. For pixel-level mask prediction it is fatal.
RoIAlign removes both quantisations. The RoI coordinates are kept as floating-point values. Each bin is sampled at four regularly spaced points, and each sample is computed by bilinear interpolation from the four nearest feature-map cells. The four samples in a bin are then aggregated by max or average pooling. The operation is fully differentiable.
The table below summarises the two operations.
| Operation | RoI coordinates | Bin boundaries | Sampling | Effect on masks |
|---|---|---|---|---|
| RoIPool | Quantised to grid | Quantised to grid | Max over integer cells | Sub-pixel misalignment |
| RoIAlign | Floating point | Floating point | Bilinear interpolation at sampled points | Aligned features |
In the ablations of the original paper, swapping RoIPool for RoIAlign improves mask AP by about 3 points and improves the strict AP at IoU 0.75 by considerably more, around 50% relative on stride-32 features. RoIAlign also helps box AP by roughly 1 point even though boxes were the existing strength of Faster R-CNN. The change is conceptually small and computationally cheap, yet it is the difference between Mask R-CNN and a method that does not work.
The network is trained with a multi-task loss that sums three terms:
L = L_cls + L_box + L_mask
The classification loss L_cls is the standard log loss over object classes. The box regression loss L_box is the smooth L1 loss inherited from Fast R-CNN. The mask loss L_mask is the average binary cross-entropy applied per pixel. Crucially, L_mask is computed only on the predicted mask channel that corresponds to the ground-truth class. The other K-1 mask channels do not contribute to the loss for that RoI.
This decoupling of mask and class prediction is one of the design decisions the paper highlights. If a softmax across classes were used at the mask output, the model would be forced to choose between classes during mask prediction, even though the classification head is already deciding that. Removing the inter-class competition gives the classifier room to drive class assignment while the mask head focuses on shape. The ablation in the paper shows about a 5.5 AP gain from this choice over a per-pixel softmax equivalent.
The COCO training recipe in the original paper is straightforward.
| Setting | Value |
|---|---|
| Optimiser | SGD with momentum 0.9 |
| Weight decay | 1e-4 |
| Batch size | 16 (8 GPUs, 2 images per GPU) |
| Base learning rate | 0.02 |
| LR schedule | Step decay at 120k iterations |
| Total iterations | 160k on COCO trainval35k |
| Image scale | Shorter side 800 px |
| Augmentation | Horizontal flip only |
| Backbone | ImageNet pre-trained, BN frozen |
Batch normalisation statistics in the backbone are frozen because the per-GPU batch size is too small to estimate them reliably; later codebases like Detectron2 add Group Normalisation or Synchronised BatchNorm options to allow training from scratch. Mask R-CNN runs at roughly 5 frames per second at inference on a Tesla M40 GPU with a ResNet-101-FPN backbone, which made it tractable for research experimentation but not real-time.
The original ICCV 2017 paper reports the following on COCO test-dev for instance segmentation. These are the canonical numbers cited in follow-up work.
| Backbone | Mask AP | AP_50 | AP_75 | Box AP |
|---|---|---|---|---|
| ResNet-101-C4 | 33.1 | 54.9 | 34.8 | - |
| ResNet-101-FPN | 35.7 | 58.0 | 37.8 | 38.2 |
| ResNeXt-101-FPN | 37.1 | 60.0 | 39.4 | 39.8 |
At release these numbers were state of the art on COCO instance segmentation, beating the COCO 2016 challenge winners by clear margins. The bounding-box numbers were also state of the art for object detection at the time, which is interesting because Mask R-CNN was not designed as a pure detector. The mask supervision served as a regulariser that improved boxes.
On the Cityscapes instance segmentation benchmark for the person and vehicle categories, Mask R-CNN reaches 26.2 AP using fine annotations only and 32.0 AP after pre-training on COCO, exceeding the previous best by about 5.5 AP. For the person category specifically, the paper reports 30.5 AP with fine data only and 34.8 AP with COCO pre-training.
Detectron2 has since reproduced and improved these numbers using better training schedules, learning-rate warmup, and updated backbones. The Detectron2 model zoo currently lists the following baselines on COCO val2017.
| Model | Schedule | Box AP | Mask AP |
|---|---|---|---|
| Mask R-CNN R50-FPN | 1x | 38.6 | 35.2 |
| Mask R-CNN R50-FPN | 3x | 41.0 | 37.2 |
| Mask R-CNN R101-FPN | 3x | 42.9 | 38.6 |
| Mask R-CNN X101-32x8d-FPN | 3x | 44.3 | 39.5 |
| Cascade Mask R-CNN R50-FPN | 3x | 44.3 | 38.5 |
These are higher than the original paper because of longer schedules, BN tweaks, and small architectural updates rather than a fundamental change.
The same paper also presents Keypoint R-CNN, where the mask head is replaced with a head that predicts K one-hot heatmaps, one per body keypoint. Each heatmap is a 56x56 grid where the model is trained with a softmax to localise the keypoint at exactly one cell. On COCO test-dev for human keypoints, the model reaches 62.7 keypoint AP using ResNet-50-FPN, which was state of the art at the time and outperformed the COCO 2016 keypoint challenge winner.
Mask R-CNN turned out to be a productive base architecture. Many of the most cited instance-segmentation papers from 2018 to 2021 are direct extensions.
Cai and Vasconcelos extended their Cascade R-CNN (CVPR 2018) to instance segmentation, producing Cascade Mask R-CNN. The detector is replaced by a sequence of three detection heads trained with progressively higher IoU thresholds (0.5, 0.6, 0.7). The output of one stage seeds the next, and at inference the predictions are averaged. This typically adds 2 to 3 box AP and 1 to 2 mask AP for a moderate compute cost.
Chen et al. (CVPR 2019) proposed Hybrid Task Cascade (HTC), which improves on Cascade Mask R-CNN in two ways. First, the detection and segmentation cascades are interleaved rather than treated as separate parallel cascades; the mask features at stage t flow into stage t+1. Second, an auxiliary semantic segmentation branch provides global context. HTC was the basis of the COCO 2018 winning entry and reaches 48.6 mask AP on COCO test-challenge with appropriate backbones. It later served as a strong starting point for the LVIS Challenge 2019.
Huang et al. (CVPR 2019) noted that COCO AP rewards calibration, and Mask R-CNN scores its masks with the classification confidence rather than a measure of mask quality. Mask Scoring R-CNN adds a small head that predicts the IoU between the predicted mask and the ground truth. The IoU prediction is multiplied with the class score at inference, producing better-calibrated rankings and a consistent 1 to 2 AP gain.
Kirillov, Wu, He, and Girshick (CVPR 2020) framed segmentation as a rendering problem. PointRend replaces the 28x28 mask head with an iterative point-based predictor that subdivides the mask, picks ambiguous boundary points, and re-evaluates them at higher resolution. The result is much sharper boundaries with little extra compute, and PointRend has become a common drop-in replacement for the standard mask head in Detectron2 pipelines.
Liu et al. (CVPR 2018) added a bottom-up path on top of FPN to shorten the information path from low-level features to RoI features and improved instance segmentation by about 2 mask AP. PANet won the COCO 2017 instance-segmentation challenge.
The Mask R-CNN architecture is modular, so the community has published several head variants.
| Head | Output | Paper |
|---|---|---|
| Mask | 28x28 binary mask per class | He et al. 2017 |
| Keypoint | 56x56 one-hot heatmap per keypoint | He et al. 2017 |
| DensePose | Dense surface coordinate prediction | Guler et al. 2018 |
| PointRend | High-resolution iterative mask | Kirillov et al. 2020 |
| Boundary-aware | Mask plus boundary supervision | Various |
| Implementation | Framework | Maintained by | Notes |
|---|---|---|---|
| Detectron2 | PyTorch | Meta AI / FAIR | Reference implementation, model zoo with 1x, 3x, Cascade, PointRend |
| Detectron (legacy) | Caffe2 | Facebook AI Research | Original public release; deprecated in favour of Detectron2 |
| MMDetection | PyTorch | OpenMMLab | Wide model coverage, configs for HTC, Cascade Mask R-CNN, Mask Scoring R-CNN |
| torchvision.models.detection.maskrcnn_resnet50_fpn | PyTorch | PyTorch / Meta | One-line load with COCO pretrained weights |
| Matterport Mask R-CNN | TensorFlow / Keras | Community | Popular early TF1 reimplementation, no longer maintained |
| TensorFlow Object Detection API | TensorFlow | Includes Mask R-CNN configs |
The torchvision wrapper exposes the simplest API. The model takes a list of float tensors in the [0, 1] range and returns dictionaries with boxes, labels, scores, and masks per image, and it can be exported to ONNX for fixed input sizes.
For all its impact, Mask R-CNN has known weaknesses.
For most of 2017 to 2021, Mask R-CNN was the reference baseline for instance segmentation. Almost every new method reported improvements relative to it. RoIAlign in particular was adopted across the field, including in object detection systems that did not perform mask prediction.
The set-prediction wave that followed eventually displaced two-stage detectors at the top of the COCO leaderboard. DETR (Carion et al., ECCV 2020) reframed detection as direct set prediction with a transformer. MaskFormer (Cheng et al., 2021) and Mask2Former (Cheng et al., CVPR 2022) extended this to instance, panoptic, and semantic segmentation under a single mask-classification framework. Mask2Former reaches 50.1 mask AP on COCO, well above the 37 to 39 range of standard Mask R-CNN. Foundation models like Segment Anything (Kirillov et al., 2023) added prompt-conditioned segmentation across a billion masks, expanding the task in a different direction.
Despite these advances, Mask R-CNN remains common in production and in domain-specific applications. It is well understood, robust to fine-tuning on small datasets, and its training and deployment cost is predictable. For researchers entering instance segmentation it is still the typical first model to run.
| Method | Year | Paradigm | Backbone (typical) | COCO mask AP | Notes |
|---|---|---|---|---|---|
| Mask R-CNN | 2017 | Two-stage, anchor-based | ResNeXt-101-FPN | 37.1 (orig.) / ~39.5 (D2) | RoIAlign, parallel mask head |
| YOLACT | 2019 | Single-stage, prototype masks | ResNet-101 | 29.8 at 33 fps | Real-time, prototype + coefficients |
| Mask Scoring R-CNN | 2019 | Two-stage, calibrated | ResNet-101-FPN | ~38.3 | Adds MaskIoU head |
| Cascade Mask R-CNN | 2018 | Two-stage cascade | ResNet-101-FPN | ~38.5 | Multi-stage IoU thresholds |
| HTC | 2019 | Two-stage cascade, interleaved | ResNeXt-101-FPN | 48.6 (test-challenge) | COCO 2018 winning baseline |
| SOLOv2 | 2020 | Single-stage, dynamic conv | ResNet-101 | 39.7 | Light version 37.1 at 31 fps |
| DETR (panoptic) | 2020 | Set prediction, transformer | ResNet-101 | ~36 mask | Removes anchors and NMS |
| MaskFormer | 2021 | Set prediction, mask classification | Swin-L | 47.2 | Unified instance + semantic |
| Mask2Former | 2022 | Masked-attention transformer | Swin-L | 50.1 | State of the art on multiple tasks |
| Segment Anything (SAM) | 2023 | Prompt-conditioned, foundation model | ViT-H | Promptable, not directly comparable | Trained on 1B masks |