Mask R-CNN

Mask R-CNN is a deep convolutional neural network for instance segmentation, introduced in 2017 by Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick at Facebook AI Research (FAIR).^[1] It extends the Faster R-CNN two-stage object detector by adding a third parallel head that predicts a binary segmentation mask for each region of interest, so a single network simultaneously detects objects and produces a pixel-accurate mask for every instance.^[1]^[3] Its two defining contributions are the parallel mask branch and RoIAlign, a quantization-free feature-sampling layer that aligns extracted features with the input, improving mask accuracy by a relative 10% to 50%.^[1] The paper describes the method as "a conceptually simple, flexible, and general framework for object instance segmentation."^[1]

Presented at the 2017 IEEE International Conference on Computer Vision (ICCV) in Venice (October 22-29, 2017), Mask R-CNN won the Best Paper Award, also known as the Marr Prize.^[1]^[23] It adds only a small overhead to Faster R-CNN and runs at about 5 frames per second, achieving top results across all three tracks of the COCO challenge suite: instance segmentation, bounding-box object detection, and person keypoint detection.^[1] The same architecture, with the mask head swapped for a heatmap head, also produced state-of-the-art human keypoint detection.^[1] Mask R-CNN was released open source by FAIR first inside the Detectron framework (Caffe2) and later in Detectron2 (PyTorch), where it became the default baseline for instance segmentation research for years; its central contribution, RoIAlign, is now standard machinery in two-stage detectors.^[20]

Mask R-CNN is the fourth model in a lineage of region-based convolutional networks from Ross Girshick and collaborators. Each step in the family addresses a bottleneck in the previous one.

Model	Year	Authors	Key idea	Speed limit
R-CNN	2014	Girshick, Donahue, Darrell, Malik	Run a CNN on each of ~2,000 selective-search proposals; SVMs classify	Per-region forward pass
SPPnet	2014	He, Zhang, Ren, Sun	Spatial pyramid pooling shares conv features across proposals	Multi-stage training
Fast R-CNN	2015	Girshick	Single shared CNN, RoI pooling, single-stage training, end-to-end except proposals	Selective search proposals
Faster R-CNN	2015	Ren, He, Girshick, Sun	Region Proposal Network (RPN) replaces selective search; trained jointly	Coarse RoIPool quantisation
Mask R-CNN	2017	He, Gkioxari, Dollar, Girshick	Adds parallel mask branch; replaces RoIPool with RoIAlign	Two-stage, anchor-based

Faster R-CNN provided the detection scaffolding.^[3] The contribution of Mask R-CNN was to show that pixel-level instance segmentation could be added cheaply, by reusing the same RoI features the box head already consumed, and that doing so even improved bounding-box accuracy.^[1] Importantly, the paper showed the architecture generalises beyond masks. Replace the mask branch with a heatmap regressor and you get a competitive human pose estimator (Keypoint R-CNN).^[1] Replace it with a dense surface coordinate predictor and you get DensePose.^[19]

How does the Mask R-CNN architecture work?

Mask R-CNN keeps every component of Faster R-CNN intact and bolts on a small fully convolutional mask head.^[1]^[3]

Backbone

The backbone extracts features from the input image. The original paper experiments with three options.^[1]

Backbone	Notes
ResNet-50 / ResNet-101 (C4)	Faster R-CNN style, head at the C4 layer
ResNet-50-FPN / ResNet-101-FPN	Multi-scale features via Feature Pyramid Network
ResNeXt-101-FPN	Best accuracy in the paper, used for headline numbers

The Feature Pyramid Network (FPN) of Lin et al. (CVPR 2017) is the recommended backbone.^[6] FPN fuses high-resolution low-level features with semantically rich high-level features through a top-down pathway with lateral connections, giving the detector a feature pyramid that handles small and large objects equally well.^[6] Backbones are pre-trained on ImageNet classification before being fine-tuned on COCO.^[1]^[7]

Region Proposal Network

The RPN is identical to the one in Faster R-CNN.^[3] It slides a small network over the convolutional feature map, predicting at each anchor whether the location contains an object and a coarse bounding box adjustment.^[3] With FPN, the RPN runs at every pyramid level, and anchors of different scales are assigned to the level whose stride matches their size.^[6] Roughly 1,000 to 2,000 proposals survive non-maximum suppression and are fed to the heads.^[1]

Three parallel heads

For every proposal that survives the RPN, three heads run in parallel:

A classification head that outputs a softmax over object classes plus background.
A class-specific bounding-box regression head that refines the coordinates.
A mask head that outputs a small binary mask for each class.

This parallel structure differs from earlier instance segmentation systems (DeepMask, MNC, FCIS) which produced masks first and classified them afterwards. Decoupling the two tasks is one of the keys to the method working as well as it does.^[1]

Mask head

The mask head is a small fully convolutional network. The standard configuration on the FPN variant is four 3x3 convolutional layers with 256 channels, followed by a 2x2 transposed convolution that upsamples to 28x28, and a 1x1 convolution that produces K binary masks (one per class).^[1] The output is treated independently per class. There is no inter-class competition because the loss applies a per-pixel sigmoid rather than a softmax across classes.^[1]

A mask resolution of 28x28 sounds tiny, but the mask is defined inside the RoI, then bilinearly upsampled and pasted back into the image during inference.^[1] For typical COCO objects the resolution is sufficient. Fine boundary details, however, are limited by this design, and later methods like PointRend specifically address it.^[12]

What is RoIAlign and why does it matter?

RoIAlign is the key technical contribution of Mask R-CNN. Faster R-CNN extracts a fixed-size feature map for every proposal using RoIPool.^[3] RoIPool quantises the floating-point RoI coordinates twice. First, the RoI itself is rounded to integer feature-map grid cells. Second, the RoI is divided into bins (typically 7x7) and the bin boundaries are again rounded to integer positions before max pooling.^[1] Each rounding can shift features by up to a stride of 16 pixels in the original image. For classification this is mostly harmless. For pixel-level mask prediction it is fatal.^[1]

RoIAlign removes both quantisations. The RoI coordinates are kept as floating-point values. Each bin is sampled at four regularly spaced points, and each sample is computed by bilinear interpolation from the four nearest feature-map cells. The four samples in a bin are then aggregated by max or average pooling. The operation is fully differentiable.^[1] The authors describe RoIAlign as "a simple, quantization-free layer that faithfully preserves exact spatial locations."^[1]

The table below summarises the two operations.

Operation	RoI coordinates	Bin boundaries	Sampling	Effect on masks
RoIPool	Quantised to grid	Quantised to grid	Max over integer cells	Sub-pixel misalignment
RoIAlign	Floating point	Floating point	Bilinear interpolation at sampled points	Aligned features

In the ablations of the original paper, swapping RoIPool for RoIAlign improves mask AP by about 3 points and improves the strict AP at IoU 0.75 by considerably more, around 50% relative on stride-32 features; overall the paper reports RoIAlign raising mask accuracy by a relative 10% to 50%.^[1] RoIAlign also helps box AP by roughly 1 point even though boxes were the existing strength of Faster R-CNN.^[1] The change is conceptually small and computationally cheap, yet it is the difference between Mask R-CNN and a method that does not work.

How is Mask R-CNN trained?

The network is trained with a multi-task loss that sums three terms:

L = L_cls + L_box + L_mask

The classification loss L_cls is the standard log loss over object classes. The box regression loss L_box is the smooth L1 loss inherited from Fast R-CNN.^[4] The mask loss L_mask is the average binary cross-entropy applied per pixel. Crucially, L_mask is computed only on the predicted mask channel that corresponds to the ground-truth class. The other K-1 mask channels do not contribute to the loss for that RoI.^[1]

This decoupling of mask and class prediction is one of the design decisions the paper highlights.^[1] If a softmax across classes were used at the mask output, the model would be forced to choose between classes during mask prediction, even though the classification head is already deciding that. Removing the inter-class competition gives the classifier room to drive class assignment while the mask head focuses on shape. The ablation in the paper shows about a 5.5 AP gain from this choice over a per-pixel softmax equivalent.^[1]

The COCO training recipe in the original paper is straightforward.^[1]

Setting	Value
Optimiser	SGD with momentum 0.9
Weight decay	1e-4
Batch size	16 (8 GPUs, 2 images per GPU)
Base learning rate	0.02
LR schedule	Step decay at 120k iterations
Total iterations	160k on COCO trainval35k
Image scale	Shorter side 800 px
Augmentation	Horizontal flip only
Backbone	ImageNet pre-trained, BN frozen

Batch normalisation statistics in the backbone are frozen because the per-GPU batch size is too small to estimate them reliably; later codebases like Detectron2 add Group Normalisation or Synchronised BatchNorm options to allow training from scratch.^[20] Mask R-CNN runs at roughly 5 frames per second at inference on a Tesla M40 GPU with a ResNet-101-FPN backbone, which made it tractable for research experimentation but not real-time.^[1]

What accuracy does Mask R-CNN achieve on COCO?

The original ICCV 2017 paper reports the following on COCO test-dev for instance segmentation. These are the canonical numbers cited in follow-up work.^[1]^[7]

Backbone	Mask AP	AP_50	AP_75	Box AP
ResNet-101-C4	33.1	54.9	34.8	-
ResNet-101-FPN	35.7	58.0	37.8	38.2
ResNeXt-101-FPN	37.1	60.0	39.4	39.8

At release these numbers were state of the art on COCO instance segmentation, beating the COCO 2016 challenge winners by clear margins.^[1] The bounding-box numbers were also state of the art for object detection at the time, which is interesting because Mask R-CNN was not designed as a pure detector. The mask supervision served as a regulariser that improved boxes.^[1]

On the Cityscapes instance segmentation benchmark for the person and vehicle categories, Mask R-CNN reaches 26.2 AP using fine annotations only and 32.0 AP after pre-training on COCO, exceeding the previous best by about 5.5 AP.^[1] For the person category specifically, the paper reports 30.5 AP with fine data only and 34.8 AP with COCO pre-training.^[1]

Detectron2 has since reproduced and improved these numbers using better training schedules, learning-rate warmup, and updated backbones.^[20] The Detectron2 model zoo currently lists the following baselines on COCO val2017.^[20]

Model	Schedule	Box AP	Mask AP
Mask R-CNN R50-FPN	1x	38.6	35.2
Mask R-CNN R50-FPN	3x	41.0	37.2
Mask R-CNN R101-FPN	3x	42.9	38.6
Mask R-CNN X101-32x8d-FPN	3x	44.3	39.5
Cascade Mask R-CNN R50-FPN	3x	44.3	38.5

These are higher than the original paper because of longer schedules, BN tweaks, and small architectural updates rather than a fundamental change.

Keypoint R-CNN

The same paper also presents Keypoint R-CNN, where the mask head is replaced with a head that predicts K one-hot heatmaps, one per body keypoint.^[1] Each heatmap is a 56x56 grid where the model is trained with a softmax to localise the keypoint at exactly one cell.^[1] On COCO test-dev for human keypoints, the model reaches 62.7 keypoint AP using ResNet-50-FPN, which was state of the art at the time and outperformed the COCO 2016 keypoint challenge winner.^[1]

What are the main variants and follow-ups?

Mask R-CNN turned out to be a productive base architecture. Many of the most cited instance-segmentation papers from 2018 to 2021 are direct extensions.

Cascade Mask R-CNN

Cai and Vasconcelos extended their Cascade R-CNN (CVPR 2018) to instance segmentation, producing Cascade Mask R-CNN.^[8] The detector is replaced by a sequence of three detection heads trained with progressively higher IoU thresholds (0.5, 0.6, 0.7).^[8] The output of one stage seeds the next, and at inference the predictions are averaged.^[8] This typically adds 2 to 3 box AP and 1 to 2 mask AP for a moderate compute cost.

Hybrid Task Cascade

Chen et al. (CVPR 2019) proposed Hybrid Task Cascade (HTC), which improves on Cascade Mask R-CNN in two ways.^[9] First, the detection and segmentation cascades are interleaved rather than treated as separate parallel cascades; the mask features at stage t flow into stage t+1. Second, an auxiliary semantic segmentation branch provides global context.^[9] HTC was the basis of the COCO 2018 winning entry and reaches 48.6 mask AP on COCO test-challenge with appropriate backbones.^[9] It later served as a strong starting point for the LVIS Challenge 2019.

Mask Scoring R-CNN

Huang et al. (CVPR 2019) noted that COCO AP rewards calibration, and Mask R-CNN scores its masks with the classification confidence rather than a measure of mask quality.^[10] Mask Scoring R-CNN adds a small head that predicts the IoU between the predicted mask and the ground truth.^[10] The IoU prediction is multiplied with the class score at inference, producing better-calibrated rankings and a consistent 1 to 2 AP gain.^[10]

PointRend

Kirillov, Wu, He, and Girshick (CVPR 2020) framed segmentation as a rendering problem.^[12] PointRend replaces the 28x28 mask head with an iterative point-based predictor that subdivides the mask, picks ambiguous boundary points, and re-evaluates them at higher resolution.^[12] The result is much sharper boundaries with little extra compute, and PointRend has become a common drop-in replacement for the standard mask head in Detectron2 pipelines.^[12]^[20]

Path Aggregation Network (PANet)

Liu et al. (CVPR 2018) added a bottom-up path on top of FPN to shorten the information path from low-level features to RoI features and improved instance segmentation by about 2 mask AP.^[11] PANet won the COCO 2017 instance-segmentation challenge.^[11]

Other heads on the same body

The Mask R-CNN architecture is modular, so the community has published several head variants.

Head	Output	Paper
Mask	28x28 binary mask per class	He et al. 2017
Keypoint	56x56 one-hot heatmap per keypoint	He et al. 2017
DensePose	Dense surface coordinate prediction	Guler et al. 2018
PointRend	High-resolution iterative mask	Kirillov et al. 2020
Boundary-aware	Mask plus boundary supervision	Various

What implementations of Mask R-CNN are available?

Implementation	Framework	Maintained by	Notes
Detectron2	PyTorch	Meta AI / FAIR	Reference implementation, model zoo with 1x, 3x, Cascade, PointRend
Detectron (legacy)	Caffe2	Facebook AI Research	Original public release; deprecated in favour of Detectron2
MMDetection	PyTorch	OpenMMLab	Wide model coverage, configs for HTC, Cascade Mask R-CNN, Mask Scoring R-CNN
torchvision.models.detection.maskrcnn_resnet50_fpn	PyTorch	PyTorch / Meta	One-line load with COCO pretrained weights
Matterport Mask R-CNN	TensorFlow / Keras	Community	Popular early TF1 reimplementation, no longer maintained
TensorFlow Object Detection API	TensorFlow	Google	Includes Mask R-CNN configs

The torchvision wrapper exposes the simplest API. The model takes a list of float tensors in the [0, 1] range and returns dictionaries with boxes, labels, scores, and masks per image, and it can be exported to ONNX for fixed input sizes.^[22]

What are the limitations of Mask R-CNN?

For all its impact, Mask R-CNN has known weaknesses.

It is a two-stage detector. The RPN proposes regions, then the heads refine them. This is slower than single-stage methods like YOLACT, SOLOv2, or YOLO-based instance segmenters, and it complicates batching because the number of proposals per image varies.^[13]^[14]
The RPN is anchor-based. Anchor sizes and aspect ratios must be tuned per dataset, and the matching of anchors to ground truth uses fixed IoU thresholds.^[3]
The 28x28 mask resolution loses fine boundary detail. Hair, thin structures, and small objects suffer. PointRend largely fixed this for users who want sharper masks.^[12]
Per-region processing is not GPU-friendly at the head. Each RoI is processed somewhat independently, which limits batch efficiency.
The architecture predicts an instance per class label slot, so it does not naturally handle long-tail vocabularies. The LVIS benchmark exposed this and motivated specialised long-tail variants.
The mask head is class-conditional. The model produces a mask only for the class predicted by the classifier; if the classifier is wrong, the mask is wasted on the wrong category.^[1]

Why is Mask R-CNN still influential?

For most of 2017 to 2021, Mask R-CNN was the reference baseline for instance segmentation. Almost every new method reported improvements relative to it. RoIAlign in particular was adopted across the field, including in object detection systems that did not perform mask prediction.^[1]

The set-prediction wave that followed eventually displaced two-stage detectors at the top of the COCO leaderboard. DETR (Carion et al., ECCV 2020) reframed detection as direct set prediction with a transformer.^[15] MaskFormer (Cheng et al., 2021) and Mask2Former (Cheng et al., CVPR 2022) extended this to instance, panoptic, and semantic segmentation under a single mask-classification framework.^[16]^[17] Mask2Former reaches 50.1 mask AP on COCO, well above the 37 to 39 range of standard Mask R-CNN.^[17] Foundation models like Segment Anything (Kirillov et al., 2023) added prompt-conditioned segmentation across a billion masks, expanding the broader task of image segmentation in a different direction.^[18]

Despite these advances, Mask R-CNN remains common in production and in domain-specific applications. It is well understood, robust to fine-tuning on small datasets, and its training and deployment cost is predictable. For researchers entering instance segmentation it is still the typical first model to run.

How does Mask R-CNN compare to other instance-segmentation methods?

Method	Year	Paradigm	Backbone (typical)	COCO mask AP	Notes
Mask R-CNN	2017	Two-stage, anchor-based	ResNeXt-101-FPN	37.1 (orig.) / ~39.5 (D2)	RoIAlign, parallel mask head
YOLACT	2019	Single-stage, prototype masks	ResNet-101	29.8 at 33 fps	Real-time, prototype + coefficients
Mask Scoring R-CNN	2019	Two-stage, calibrated	ResNet-101-FPN	~38.3	Adds MaskIoU head
Cascade Mask R-CNN	2018	Two-stage cascade	ResNet-101-FPN	~38.5	Multi-stage IoU thresholds
HTC	2019	Two-stage cascade, interleaved	ResNeXt-101-FPN	48.6 (test-challenge)	COCO 2018 winning baseline
SOLOv2	2020	Single-stage, dynamic conv	ResNet-101	39.7	Light version 37.1 at 31 fps
DETR (panoptic)	2020	Set prediction, transformer	ResNet-101	~36 mask	Removes anchors and NMS
MaskFormer	2021	Set prediction, mask classification	Swin-L	47.2	Unified instance + semantic
Mask2Former	2022	Masked-attention transformer	Swin-L	50.1	State of the art on multiple tasks
Segment Anything (SAM)	2023	Prompt-conditioned, foundation model	ViT-H	Promptable, not directly comparable	Trained on 1B masks

References

He, K., Gkioxari, G., Dollar, P., and Girshick, R. "Mask R-CNN." Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2961-2969. arXiv:1703.06870. Open access PDF.
He, K., Gkioxari, G., Dollar, P., and Girshick, R. "Mask R-CNN." IEEE Transactions on Pattern Analysis and Machine Intelligence (extended journal version), 2020.
Ren, S., He, K., Girshick, R., and Sun, J. "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks." Advances in Neural Information Processing Systems 28 (NeurIPS 2015). arXiv:1506.01497.
Girshick, R. "Fast R-CNN." Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015. arXiv:1504.08083.
Girshick, R., Donahue, J., Darrell, T., and Malik, J. "Rich feature hierarchies for accurate object detection and semantic segmentation." CVPR 2014. arXiv:1311.2524.
Lin, T.-Y., Dollar, P., Girshick, R., He, K., Hariharan, B., and Belongie, S. "Feature Pyramid Networks for Object Detection." CVPR 2017. arXiv:1612.03144.
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., and Zitnick, C. L. "Microsoft COCO: Common Objects in Context." ECCV 2014. arXiv:1405.0312.
Cai, Z., and Vasconcelos, N. "Cascade R-CNN: Delving into High Quality Object Detection." CVPR 2018. arXiv:1712.00726.
Chen, K., Pang, J., Wang, J., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Shi, J., Ouyang, W., Loy, C. C., and Lin, D. "Hybrid Task Cascade for Instance Segmentation." CVPR 2019. arXiv:1901.07518.
Huang, Z., Huang, L., Gong, Y., Huang, C., and Wang, X. "Mask Scoring R-CNN." CVPR 2019. arXiv:1903.00241.
Liu, S., Qi, L., Qin, H., Shi, J., and Jia, J. "Path Aggregation Network for Instance Segmentation." CVPR 2018. arXiv:1803.01534.
Kirillov, A., Wu, Y., He, K., and Girshick, R. "PointRend: Image Segmentation as Rendering." CVPR 2020. arXiv:1912.08193.
Bolya, D., Zhou, C., Xiao, F., and Lee, Y. J. "YOLACT: Real-time Instance Segmentation." ICCV 2019. arXiv:1904.02689.
Wang, X., Zhang, R., Kong, T., Li, L., and Shen, C. "SOLOv2: Dynamic and Fast Instance Segmentation." NeurIPS 2020. arXiv:2003.10152.
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., and Zagoruyko, S. "End-to-End Object Detection with Transformers." ECCV 2020. arXiv:2005.12872.
Cheng, B., Schwing, A. G., and Kirillov, A. "Per-Pixel Classification is Not All You Need for Semantic Segmentation" (MaskFormer). NeurIPS 2021. arXiv:2107.06278.
Cheng, B., Misra, I., Schwing, A. G., Kirillov, A., and Girdhar, R. "Masked-attention Mask Transformer for Universal Image Segmentation" (Mask2Former). CVPR 2022. arXiv:2112.01527.
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C., Lo, W.-Y., Dollar, P., and Girshick, R. "Segment Anything." ICCV 2023. arXiv:2304.02643.
Guler, R. A., Neverova, N., and Kokkinos, I. "DensePose: Dense Human Pose Estimation In The Wild." CVPR 2018. arXiv:1802.00434.
Wu, Y., Kirillov, A., Massa, F., Lo, W.-Y., and Girshick, R. "Detectron2." Facebook AI Research, 2019. Repository. Model zoo.
OpenMMLab. "MMDetection: Open MMLab Detection Toolbox and Benchmark." Chen, K. et al. arXiv:1906.07155. Documentation.
PyTorch. "torchvision.models.detection.maskrcnn_resnet50_fpn." API documentation.
IEEE Signal Processing Society. "ICCV 2017 Best Paper Award: Mask R-CNN." January 2018. Newsletter.

How is Mask R-CNN related to R-CNN and Faster R-CNN?

How does the Mask R-CNN architecture work?

Backbone

Region Proposal Network

Three parallel heads

Mask head

What is RoIAlign and why does it matter?

How is Mask R-CNN trained?

What accuracy does Mask R-CNN achieve on COCO?

Keypoint R-CNN

What are the main variants and follow-ups?

Cascade Mask R-CNN

Hybrid Task Cascade

Mask Scoring R-CNN

PointRend

Path Aggregation Network (PANet)

Other heads on the same body

What implementations of Mask R-CNN are available?

What are the limitations of Mask R-CNN?

Why is Mask R-CNN still influential?

How does Mask R-CNN compare to other instance-segmentation methods?

References

Improve this article

Related Articles

Lyte

Diffusion model

Translational invariance

Computer vision

Computer Vision Models

Convolutional Filter

What links here

How is Mask R-CNN related to R-CNN and Faster R-CNN?

How does the Mask R-CNN architecture work?

Backbone

Region Proposal Network

Three parallel heads

Mask head

What is RoIAlign and why does it matter?

How is Mask R-CNN trained?

What accuracy does Mask R-CNN achieve on COCO?

Keypoint R-CNN

What are the main variants and follow-ups?

Cascade Mask R-CNN

Hybrid Task Cascade

Mask Scoring R-CNN

PointRend

Path Aggregation Network (PANet)

Other heads on the same body

What implementations of Mask R-CNN are available?

What are the limitations of Mask R-CNN?

Why is Mask R-CNN still influential?

How does Mask R-CNN compare to other instance-segmentation methods?

References

Related Articles

Lyte

Diffusion model

Translational invariance

Computer vision

Computer Vision Models

Convolutional Filter

What links here