# Mask R-CNN

> Source: https://aiwiki.ai/wiki/mask_r_cnn
> Updated: 2026-06-22
> Categories: Computer Vision, Deep Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Mask R-CNN** is a deep [convolutional neural network](/wiki/convolutional_neural_network) for [instance segmentation](/wiki/instance_segmentation), introduced in 2017 by Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick at Facebook AI Research (FAIR).[1] It extends the [Faster R-CNN](/wiki/faster_r_cnn) two-stage object detector by adding a third parallel head that predicts a binary segmentation mask for each region of interest, so a single network simultaneously detects objects and produces a pixel-accurate mask for every instance.[1][3] Its two defining contributions are the parallel mask branch and **RoIAlign**, a quantization-free feature-sampling layer that aligns extracted features with the input, improving mask accuracy by a relative 10% to 50%.[1] The paper describes the method as "a conceptually simple, flexible, and general framework for object instance segmentation."[1]

Presented at the 2017 IEEE International Conference on Computer Vision (ICCV) in Venice (October 22-29, 2017), Mask R-CNN won the Best Paper Award, also known as the Marr Prize.[1][23] It adds only a small overhead to Faster R-CNN and runs at about 5 frames per second, achieving top results across all three tracks of the COCO challenge suite: instance segmentation, bounding-box object detection, and person keypoint detection.[1] The same architecture, with the mask head swapped for a heatmap head, also produced state-of-the-art human keypoint detection.[1] Mask R-CNN was released open source by FAIR first inside the Detectron framework (Caffe2) and later in Detectron2 (PyTorch), where it became the default baseline for instance segmentation research for years; its central contribution, RoIAlign, is now standard machinery in two-stage detectors.[20]

## How is Mask R-CNN related to R-CNN and Faster R-CNN?

Mask R-CNN is the fourth model in a lineage of region-based convolutional networks from Ross Girshick and collaborators. Each step in the family addresses a bottleneck in the previous one.

| Model | Year | Authors | Key idea | Speed limit |
|-------|------|---------|----------|-------------|
| R-CNN | 2014 | Girshick, Donahue, Darrell, Malik | Run a CNN on each of ~2,000 selective-search proposals; SVMs classify | Per-region forward pass |
| SPPnet | 2014 | He, Zhang, Ren, Sun | Spatial pyramid pooling shares conv features across proposals | Multi-stage training |
| Fast R-CNN | 2015 | Girshick | Single shared CNN, RoI pooling, single-stage training, end-to-end except proposals | Selective search proposals |
| Faster R-CNN | 2015 | Ren, He, Girshick, Sun | Region Proposal Network (RPN) replaces selective search; trained jointly | Coarse RoIPool quantisation |
| Mask R-CNN | 2017 | He, Gkioxari, Dollar, Girshick | Adds parallel mask branch; replaces RoIPool with RoIAlign | Two-stage, anchor-based |

Faster R-CNN provided the detection scaffolding.[3] The contribution of Mask R-CNN was to show that pixel-level instance segmentation could be added cheaply, by reusing the same RoI features the box head already consumed, and that doing so even improved bounding-box accuracy.[1] Importantly, the paper showed the architecture generalises beyond masks. Replace the mask branch with a heatmap regressor and you get a competitive human pose estimator (Keypoint R-CNN).[1] Replace it with a dense surface coordinate predictor and you get [DensePose](https://arxiv.org/abs/1802.00434).[19]

## How does the Mask R-CNN architecture work?

Mask R-CNN keeps every component of Faster R-CNN intact and bolts on a small fully convolutional mask head.[1][3]

### Backbone

The backbone extracts features from the input image. The original paper experiments with three options.[1]

| Backbone | Notes |
|----------|-------|
| [ResNet](/wiki/resnet)-50 / ResNet-101 (C4) | Faster R-CNN style, head at the C4 layer |
| ResNet-50-FPN / ResNet-101-FPN | Multi-scale features via Feature Pyramid Network |
| ResNeXt-101-FPN | Best accuracy in the paper, used for headline numbers |

The Feature Pyramid Network (FPN) of Lin et al. (CVPR 2017) is the recommended backbone.[6] FPN fuses high-resolution low-level features with semantically rich high-level features through a top-down pathway with lateral connections, giving the detector a feature pyramid that handles small and large objects equally well.[6] Backbones are pre-trained on ImageNet classification before being fine-tuned on COCO.[1][7]

### Region Proposal Network

The RPN is identical to the one in Faster R-CNN.[3] It slides a small network over the convolutional feature map, predicting at each anchor whether the location contains an object and a coarse bounding box adjustment.[3] With FPN, the RPN runs at every pyramid level, and anchors of different scales are assigned to the level whose stride matches their size.[6] Roughly 1,000 to 2,000 proposals survive non-maximum suppression and are fed to the heads.[1]

### Three parallel heads

For every proposal that survives the RPN, three heads run in parallel:

1. A classification head that outputs a softmax over object classes plus background.
2. A class-specific bounding-box regression head that refines the coordinates.
3. A mask head that outputs a small binary mask for each class.

This parallel structure differs from earlier instance segmentation systems (DeepMask, MNC, FCIS) which produced masks first and classified them afterwards. Decoupling the two tasks is one of the keys to the method working as well as it does.[1]

### Mask head

The mask head is a small fully convolutional network. The standard configuration on the FPN variant is four 3x3 convolutional layers with 256 channels, followed by a 2x2 transposed convolution that upsamples to 28x28, and a 1x1 convolution that produces K binary masks (one per class).[1] The output is treated independently per class. There is no inter-class competition because the loss applies a per-pixel sigmoid rather than a softmax across classes.[1]

A mask resolution of 28x28 sounds tiny, but the mask is defined inside the RoI, then bilinearly upsampled and pasted back into the image during inference.[1] For typical COCO objects the resolution is sufficient. Fine boundary details, however, are limited by this design, and later methods like PointRend specifically address it.[12]

## What is RoIAlign and why does it matter?

RoIAlign is the key technical contribution of Mask R-CNN. Faster R-CNN extracts a fixed-size feature map for every proposal using **RoIPool**.[3] RoIPool quantises the floating-point RoI coordinates twice. First, the RoI itself is rounded to integer feature-map grid cells. Second, the RoI is divided into bins (typically 7x7) and the bin boundaries are again rounded to integer positions before max pooling.[1] Each rounding can shift features by up to a stride of 16 pixels in the original image. For classification this is mostly harmless. For pixel-level mask prediction it is fatal.[1]

**RoIAlign** removes both quantisations. The RoI coordinates are kept as floating-point values. Each bin is sampled at four regularly spaced points, and each sample is computed by bilinear interpolation from the four nearest feature-map cells. The four samples in a bin are then aggregated by max or average pooling. The operation is fully differentiable.[1] The authors describe RoIAlign as "a simple, quantization-free layer that faithfully preserves exact spatial locations."[1]

The table below summarises the two operations.

| Operation | RoI coordinates | Bin boundaries | Sampling | Effect on masks |
|-----------|-----------------|----------------|----------|-----------------|
| RoIPool | Quantised to grid | Quantised to grid | Max over integer cells | Sub-pixel misalignment |
| RoIAlign | Floating point | Floating point | Bilinear interpolation at sampled points | Aligned features |

In the ablations of the original paper, swapping RoIPool for RoIAlign improves mask AP by about 3 points and improves the strict AP at IoU 0.75 by considerably more, around 50% relative on stride-32 features; overall the paper reports RoIAlign raising mask accuracy by a relative 10% to 50%.[1] RoIAlign also helps box AP by roughly 1 point even though boxes were the existing strength of Faster R-CNN.[1] The change is conceptually small and computationally cheap, yet it is the difference between Mask R-CNN and a method that does not work.

## How is Mask R-CNN trained?

The network is trained with a multi-task loss that sums three terms:

```
L = L_cls + L_box + L_mask
```

The classification loss `L_cls` is the standard log loss over object classes. The box regression loss `L_box` is the smooth L1 loss inherited from Fast R-CNN.[4] The mask loss `L_mask` is the average binary cross-entropy applied per pixel. Crucially, `L_mask` is computed only on the predicted mask channel that corresponds to the ground-truth class. The other K-1 mask channels do not contribute to the loss for that RoI.[1]

This decoupling of mask and class prediction is one of the design decisions the paper highlights.[1] If a softmax across classes were used at the mask output, the model would be forced to choose between classes during mask prediction, even though the classification head is already deciding that. Removing the inter-class competition gives the classifier room to drive class assignment while the mask head focuses on shape. The ablation in the paper shows about a 5.5 AP gain from this choice over a per-pixel softmax equivalent.[1]

The COCO training recipe in the original paper is straightforward.[1]

| Setting | Value |
|---------|-------|
| Optimiser | SGD with momentum 0.9 |
| Weight decay | 1e-4 |
| Batch size | 16 (8 GPUs, 2 images per GPU) |
| Base learning rate | 0.02 |
| LR schedule | Step decay at 120k iterations |
| Total iterations | 160k on COCO trainval35k |
| Image scale | Shorter side 800 px |
| Augmentation | Horizontal flip only |
| Backbone | ImageNet pre-trained, BN frozen |

Batch normalisation statistics in the backbone are frozen because the per-GPU batch size is too small to estimate them reliably; later codebases like Detectron2 add Group Normalisation or Synchronised BatchNorm options to allow training from scratch.[20] Mask R-CNN runs at roughly 5 frames per second at inference on a Tesla M40 GPU with a ResNet-101-FPN backbone, which made it tractable for research experimentation but not real-time.[1]

## What accuracy does Mask R-CNN achieve on COCO?

The original ICCV 2017 paper reports the following on COCO test-dev for instance segmentation. These are the canonical numbers cited in follow-up work.[1][7]

| Backbone | Mask AP | AP_50 | AP_75 | Box AP |
|----------|---------|-------|-------|--------|
| ResNet-101-C4 | 33.1 | 54.9 | 34.8 | - |
| ResNet-101-FPN | 35.7 | 58.0 | 37.8 | 38.2 |
| ResNeXt-101-FPN | 37.1 | 60.0 | 39.4 | 39.8 |

At release these numbers were state of the art on COCO instance segmentation, beating the COCO 2016 challenge winners by clear margins.[1] The bounding-box numbers were also state of the art for object detection at the time, which is interesting because Mask R-CNN was not designed as a pure detector. The mask supervision served as a regulariser that improved boxes.[1]

On the Cityscapes instance segmentation benchmark for the person and vehicle categories, Mask R-CNN reaches 26.2 AP using fine annotations only and 32.0 AP after pre-training on COCO, exceeding the previous best by about 5.5 AP.[1] For the person category specifically, the paper reports 30.5 AP with fine data only and 34.8 AP with COCO pre-training.[1]

Detectron2 has since reproduced and improved these numbers using better training schedules, learning-rate warmup, and updated backbones.[20] The Detectron2 model zoo currently lists the following baselines on COCO val2017.[20]

| Model | Schedule | Box AP | Mask AP |
|-------|----------|--------|---------|
| Mask R-CNN R50-FPN | 1x | 38.6 | 35.2 |
| Mask R-CNN R50-FPN | 3x | 41.0 | 37.2 |
| Mask R-CNN R101-FPN | 3x | 42.9 | 38.6 |
| Mask R-CNN X101-32x8d-FPN | 3x | 44.3 | 39.5 |
| Cascade Mask R-CNN R50-FPN | 3x | 44.3 | 38.5 |

These are higher than the original paper because of longer schedules, BN tweaks, and small architectural updates rather than a fundamental change.

### Keypoint R-CNN

The same paper also presents **Keypoint R-CNN**, where the mask head is replaced with a head that predicts K one-hot heatmaps, one per body keypoint.[1] Each heatmap is a 56x56 grid where the model is trained with a softmax to localise the keypoint at exactly one cell.[1] On COCO test-dev for human keypoints, the model reaches 62.7 keypoint AP using ResNet-50-FPN, which was state of the art at the time and outperformed the COCO 2016 keypoint challenge winner.[1]

## What are the main variants and follow-ups?

Mask R-CNN turned out to be a productive base architecture. Many of the most cited instance-segmentation papers from 2018 to 2021 are direct extensions.

### Cascade Mask R-CNN

Cai and Vasconcelos extended their Cascade R-CNN (CVPR 2018) to instance segmentation, producing **Cascade Mask R-CNN**.[8] The detector is replaced by a sequence of three detection heads trained with progressively higher IoU thresholds (0.5, 0.6, 0.7).[8] The output of one stage seeds the next, and at inference the predictions are averaged.[8] This typically adds 2 to 3 box AP and 1 to 2 mask AP for a moderate compute cost.

### Hybrid Task Cascade

Chen et al. (CVPR 2019) proposed **Hybrid Task Cascade (HTC)**, which improves on Cascade Mask R-CNN in two ways.[9] First, the detection and segmentation cascades are interleaved rather than treated as separate parallel cascades; the mask features at stage `t` flow into stage `t+1`. Second, an auxiliary semantic segmentation branch provides global context.[9] HTC was the basis of the COCO 2018 winning entry and reaches 48.6 mask AP on COCO test-challenge with appropriate backbones.[9] It later served as a strong starting point for the LVIS Challenge 2019.

### Mask Scoring R-CNN

Huang et al. (CVPR 2019) noted that COCO AP rewards calibration, and Mask R-CNN scores its masks with the classification confidence rather than a measure of mask quality.[10] **Mask Scoring R-CNN** adds a small head that predicts the IoU between the predicted mask and the ground truth.[10] The IoU prediction is multiplied with the class score at inference, producing better-calibrated rankings and a consistent 1 to 2 AP gain.[10]

### PointRend

Kirillov, Wu, He, and Girshick (CVPR 2020) framed segmentation as a rendering problem.[12] **PointRend** replaces the 28x28 mask head with an iterative point-based predictor that subdivides the mask, picks ambiguous boundary points, and re-evaluates them at higher resolution.[12] The result is much sharper boundaries with little extra compute, and PointRend has become a common drop-in replacement for the standard mask head in Detectron2 pipelines.[12][20]

### Path Aggregation Network (PANet)

Liu et al. (CVPR 2018) added a bottom-up path on top of FPN to shorten the information path from low-level features to RoI features and improved instance segmentation by about 2 mask AP.[11] PANet won the COCO 2017 instance-segmentation challenge.[11]

### Other heads on the same body

The Mask R-CNN architecture is modular, so the community has published several head variants.

| Head | Output | Paper |
|------|--------|-------|
| Mask | 28x28 binary mask per class | He et al. 2017 |
| Keypoint | 56x56 one-hot heatmap per keypoint | He et al. 2017 |
| DensePose | Dense surface coordinate prediction | Guler et al. 2018 |
| PointRend | High-resolution iterative mask | Kirillov et al. 2020 |
| Boundary-aware | Mask plus boundary supervision | Various |

## What implementations of Mask R-CNN are available?

| Implementation | Framework | Maintained by | Notes |
|---------------|-----------|---------------|-------|
| Detectron2 | PyTorch | Meta AI / FAIR | Reference implementation, model zoo with 1x, 3x, Cascade, PointRend |
| Detectron (legacy) | Caffe2 | Facebook AI Research | Original public release; deprecated in favour of Detectron2 |
| MMDetection | PyTorch | OpenMMLab | Wide model coverage, configs for HTC, Cascade Mask R-CNN, Mask Scoring R-CNN |
| torchvision.models.detection.maskrcnn_resnet50_fpn | PyTorch | PyTorch / Meta | One-line load with COCO pretrained weights |
| Matterport Mask R-CNN | TensorFlow / Keras | Community | Popular early TF1 reimplementation, no longer maintained |
| TensorFlow Object Detection API | TensorFlow | Google | Includes Mask R-CNN configs |

The torchvision wrapper exposes the simplest API. The model takes a list of float tensors in the [0, 1] range and returns dictionaries with boxes, labels, scores, and masks per image, and it can be exported to ONNX for fixed input sizes.[22]

## What are the limitations of Mask R-CNN?

For all its impact, Mask R-CNN has known weaknesses.

1. It is a two-stage detector. The RPN proposes regions, then the heads refine them. This is slower than single-stage methods like YOLACT, SOLOv2, or [YOLO](/wiki/yolo)-based instance segmenters, and it complicates batching because the number of proposals per image varies.[13][14]
2. The RPN is anchor-based. Anchor sizes and aspect ratios must be tuned per dataset, and the matching of anchors to ground truth uses fixed IoU thresholds.[3]
3. The 28x28 mask resolution loses fine boundary detail. Hair, thin structures, and small objects suffer. PointRend largely fixed this for users who want sharper masks.[12]
4. Per-region processing is not GPU-friendly at the head. Each RoI is processed somewhat independently, which limits batch efficiency.
5. The architecture predicts an instance per class label slot, so it does not naturally handle long-tail vocabularies. The [LVIS](/wiki/lvis) benchmark exposed this and motivated specialised long-tail variants.
6. The mask head is class-conditional. The model produces a mask only for the class predicted by the classifier; if the classifier is wrong, the mask is wasted on the wrong category.[1]

## Why is Mask R-CNN still influential?

For most of 2017 to 2021, Mask R-CNN was the reference baseline for instance segmentation. Almost every new method reported improvements relative to it. RoIAlign in particular was adopted across the field, including in [object detection](/wiki/object_detection) systems that did not perform mask prediction.[1]

The set-prediction wave that followed eventually displaced two-stage detectors at the top of the COCO leaderboard. DETR (Carion et al., ECCV 2020) reframed detection as direct set prediction with a transformer.[15] MaskFormer (Cheng et al., 2021) and Mask2Former (Cheng et al., CVPR 2022) extended this to instance, panoptic, and [semantic segmentation](/wiki/semantic_segmentation) under a single mask-classification framework.[16][17] Mask2Former reaches 50.1 mask AP on COCO, well above the 37 to 39 range of standard Mask R-CNN.[17] Foundation models like Segment Anything (Kirillov et al., 2023) added prompt-conditioned segmentation across a billion masks, expanding the broader task of [image segmentation](/wiki/image_segmentation) in a different direction.[18]

Despite these advances, Mask R-CNN remains common in production and in domain-specific applications. It is well understood, robust to fine-tuning on small datasets, and its training and deployment cost is predictable. For researchers entering instance segmentation it is still the typical first model to run.

## How does Mask R-CNN compare to other instance-segmentation methods?

| Method | Year | Paradigm | Backbone (typical) | COCO mask AP | Notes |
|--------|------|----------|--------------------|--------------|-------|
| Mask R-CNN | 2017 | Two-stage, anchor-based | ResNeXt-101-FPN | 37.1 (orig.) / ~39.5 (D2) | RoIAlign, parallel mask head |
| YOLACT | 2019 | Single-stage, prototype masks | ResNet-101 | 29.8 at 33 fps | Real-time, prototype + coefficients |
| Mask Scoring R-CNN | 2019 | Two-stage, calibrated | ResNet-101-FPN | ~38.3 | Adds MaskIoU head |
| Cascade Mask R-CNN | 2018 | Two-stage cascade | ResNet-101-FPN | ~38.5 | Multi-stage IoU thresholds |
| HTC | 2019 | Two-stage cascade, interleaved | ResNeXt-101-FPN | 48.6 (test-challenge) | COCO 2018 winning baseline |
| SOLOv2 | 2020 | Single-stage, dynamic conv | ResNet-101 | 39.7 | Light version 37.1 at 31 fps |
| DETR (panoptic) | 2020 | Set prediction, transformer | ResNet-101 | ~36 mask | Removes anchors and NMS |
| MaskFormer | 2021 | Set prediction, mask classification | Swin-L | 47.2 | Unified instance + semantic |
| Mask2Former | 2022 | Masked-attention transformer | Swin-L | 50.1 | State of the art on multiple tasks |
| Segment Anything (SAM) | 2023 | Prompt-conditioned, foundation model | ViT-H | Promptable, not directly comparable | Trained on 1B masks |

## References

1. He, K., Gkioxari, G., Dollar, P., and Girshick, R. "Mask R-CNN." Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2961-2969. arXiv:1703.06870. [Open access PDF](https://openaccess.thecvf.com/content_ICCV_2017/papers/He_Mask_R-CNN_ICCV_2017_paper.pdf).
2. He, K., Gkioxari, G., Dollar, P., and Girshick, R. "Mask R-CNN." IEEE Transactions on Pattern Analysis and Machine Intelligence (extended journal version), 2020.
3. Ren, S., He, K., Girshick, R., and Sun, J. "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks." Advances in Neural Information Processing Systems 28 (NeurIPS 2015). arXiv:1506.01497.
4. Girshick, R. "Fast R-CNN." Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015. arXiv:1504.08083.
5. Girshick, R., Donahue, J., Darrell, T., and Malik, J. "Rich feature hierarchies for accurate object detection and semantic segmentation." CVPR 2014. arXiv:1311.2524.
6. Lin, T.-Y., Dollar, P., Girshick, R., He, K., Hariharan, B., and Belongie, S. "Feature Pyramid Networks for Object Detection." CVPR 2017. arXiv:1612.03144.
7. Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., and Zitnick, C. L. "Microsoft COCO: Common Objects in Context." ECCV 2014. arXiv:1405.0312.
8. Cai, Z., and Vasconcelos, N. "Cascade R-CNN: Delving into High Quality Object Detection." CVPR 2018. arXiv:1712.00726.
9. Chen, K., Pang, J., Wang, J., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Shi, J., Ouyang, W., Loy, C. C., and Lin, D. "Hybrid Task Cascade for Instance Segmentation." CVPR 2019. arXiv:1901.07518.
10. Huang, Z., Huang, L., Gong, Y., Huang, C., and Wang, X. "Mask Scoring R-CNN." CVPR 2019. arXiv:1903.00241.
11. Liu, S., Qi, L., Qin, H., Shi, J., and Jia, J. "Path Aggregation Network for Instance Segmentation." CVPR 2018. arXiv:1803.01534.
12. Kirillov, A., Wu, Y., He, K., and Girshick, R. "PointRend: Image Segmentation as Rendering." CVPR 2020. arXiv:1912.08193.
13. Bolya, D., Zhou, C., Xiao, F., and Lee, Y. J. "YOLACT: Real-time Instance Segmentation." ICCV 2019. arXiv:1904.02689.
14. Wang, X., Zhang, R., Kong, T., Li, L., and Shen, C. "SOLOv2: Dynamic and Fast Instance Segmentation." NeurIPS 2020. arXiv:2003.10152.
15. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., and Zagoruyko, S. "End-to-End Object Detection with Transformers." ECCV 2020. arXiv:2005.12872.
16. Cheng, B., Schwing, A. G., and Kirillov, A. "Per-Pixel Classification is Not All You Need for Semantic Segmentation" (MaskFormer). NeurIPS 2021. arXiv:2107.06278.
17. Cheng, B., Misra, I., Schwing, A. G., Kirillov, A., and Girdhar, R. "Masked-attention Mask Transformer for Universal Image Segmentation" (Mask2Former). CVPR 2022. arXiv:2112.01527.
18. Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C., Lo, W.-Y., Dollar, P., and Girshick, R. "Segment Anything." ICCV 2023. arXiv:2304.02643.
19. Guler, R. A., Neverova, N., and Kokkinos, I. "DensePose: Dense Human Pose Estimation In The Wild." CVPR 2018. arXiv:1802.00434.
20. Wu, Y., Kirillov, A., Massa, F., Lo, W.-Y., and Girshick, R. "Detectron2." Facebook AI Research, 2019. [Repository](https://github.com/facebookresearch/detectron2). [Model zoo](https://github.com/facebookresearch/detectron2/blob/main/MODEL_ZOO.md).
21. OpenMMLab. "MMDetection: Open MMLab Detection Toolbox and Benchmark." Chen, K. et al. arXiv:1906.07155. [Documentation](https://mmdetection.readthedocs.io/).
22. PyTorch. "torchvision.models.detection.maskrcnn_resnet50_fpn." [API documentation](https://docs.pytorch.org/vision/stable/models/generated/torchvision.models.detection.maskrcnn_resnet50_fpn.html).
23. IEEE Signal Processing Society. "ICCV 2017 Best Paper Award: Mask R-CNN." January 2018. [Newsletter](https://signalprocessingsociety.org/newsletter/2018/01/iccv-2017-best-paper-award-mask-r-cnn).