# SSD (Single Shot MultiBox Detector)

> Source: https://aiwiki.ai/wiki/ssd_object_detection
> Updated: 2026-06-25
> Categories: Computer Vision, Deep Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**SSD (Single Shot MultiBox Detector)** is a single stage [object detection](/wiki/object_detection) model that predicts bounding boxes and per class confidence scores in one forward pass through a convolutional network, using a dense set of default boxes (anchors) tiled at multiple feature map scales [1]. Introduced by Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu and Alexander C. Berg, SSD was first posted to arXiv on December 8, 2015 (arXiv:1512.02325) and published at the European Conference on Computer Vision (ECCV) in 2016 [1]. Its headline configuration, SSD300, reaches 74.3% mean average precision (mAP) on PASCAL VOC 2007 test while running at 59 frames per second on an Nvidia Titan X, making it one of the first single stage detectors to match two stage accuracy at real time speed [1].

The paper frames the method plainly: "We present a method for detecting objects in images using a single deep neural network" that "discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location" [1]. SSD was one of the first single stage detectors to reach accuracy comparable to two stage methods such as [Faster R-CNN](/wiki/faster_r_cnn) while still running at real time speeds on a single GPU [1].

The model is associated with several institutions: UNC Chapel Hill (Wei Liu and Alexander Berg, who was Wei Liu's advisor), Zoox Inc. (Dragomir Anguelov), Google Inc. (Dumitru Erhan, Christian Szegedy, Scott Reed) and the University of Michigan, Ann Arbor (Scott Reed at the time) [1]. The system was released as open source as a fork of Caffe under the `weiliu89/caffe` repository on GitHub, which became the canonical reference implementation for years [12].

## What is SSD?

SSD is a convolutional [object detection](/wiki/object_detection) architecture that produces all of its detections in a single forward pass, with no separate region proposal step [1]. A backbone [convolutional neural network](/wiki/convolutional_neural_network) extracts features, several extra convolutional layers are stacked on top to produce feature maps at decreasing resolutions, and a small convolutional predictor is attached to each of those maps. For every default box at every cell of every chosen feature map, the predictor outputs class confidences and four box offsets simultaneously. The abstract summarizes the speed advantage directly: "The fundamental improvement in speed comes from eliminating bounding box proposals and the subsequent pixel or feature resampling stage" [1].

The authors also stress deployability: "This makes SSD easy to train and straightforward to integrate into systems that require a detection component" [1]. That property, more than raw benchmark accuracy, is why SSD and its mobile descendants remain in production a decade later.

## When was SSD released?

The SSD paper was first posted to arXiv on December 8, 2015 (arXiv:1512.02325), and the work was presented at ECCV 2016 [1]. It arrived at a moment when deep object detection had split into two camps, and it was the first published method to break the speed versus accuracy trade off between them in a clean way.

By late 2015, two stage region based methods (R-CNN, Fast R-CNN, Faster R-CNN) dominated the accuracy leaderboards on PASCAL VOC and the COCO challenge. Faster R-CNN with a VGG-16 backbone reached 73.2% mAP on PASCAL VOC 2007 test, but it ran at only 7 frames per second on a Titan X [1]. That was too slow for many production scenarios such as robotics, mobile inference or on-board car perception.

On the other side, the first version of YOLO (You Only Look Once), published by Joseph Redmon and colleagues at CVPR 2016, regressed boxes and classes directly from a single grid of cells [7]. It hit 45 frames per second but only 63.4% mAP on the same benchmark [7]. So users had to choose between accurate and slow or fast and inaccurate.

SSD removed the region proposal stage entirely, like YOLO, and added two ideas that pushed accuracy back up: predictions from multiple feature map resolutions, and a dense set of default boxes (anchor boxes) at each spatial location of those feature maps [1]. With a 300x300 input it ran at 59 FPS on a Titan X and reached 74.3% mAP on VOC 2007 test. With a 512x512 input it reached 76.8% mAP at about 22 FPS, surpassing Faster R-CNN on accuracy while still being faster [1].

The MultiBox in the name refers back to earlier work. In the 2014 CVPR paper *Scalable Object Detection using Deep Neural Networks*, Dumitru Erhan, Christian Szegedy, Alexander Toshev and Dragomir Anguelov introduced a system that learned to regress class agnostic boxes with confidence scores [2]. SSD inherits the spirit of that approach, removes the class agnostic step and folds the box and class prediction into a single dense convolutional head.

## How does SSD work?

SSD is built on top of a convolutional backbone with a set of additional convolutional feature layers added on top, plus convolutional predictors attached at multiple resolutions [1].

### What backbone does SSD use?

The original SSD uses [VGG-16](/wiki/vgg) as the backbone, truncated before the classification layers [1]. The fully connected layers fc6 and fc7 in VGG are converted to convolutional layers (with subsampling) so that the output of fc7 is a 19x19 feature map for a 300x300 input. The atrous (dilated) convolution trick is applied so the receptive field stays large. After the converted fc7, a series of extra convolutional layers progressively reduce spatial resolution, producing feature maps of 10x10, 5x5, 3x3 and 1x1 [1].

Other backbones followed in later work and library implementations: ResNet-101 in DSSD [3], MobileNet V1, V2 and V3 in MobileNet-SSD and SSDLite [5], and various lightweight backbones in community ports.

### Why does SSD predict at multiple feature map scales?

In SSD300, predictions are made at six feature map resolutions [1]:

| Feature map  | Spatial size | Default boxes per cell | Boxes contributed |
| ------------ | ------------ | ---------------------- | ----------------- |
| Conv4_3      | 38x38        | 4                      | 5,776             |
| Conv7 (FC7)  | 19x19        | 6                      | 2,166             |
| Conv8_2      | 10x10        | 6                      | 600               |
| Conv9_2      | 5x5          | 6                      | 150               |
| Conv10_2     | 3x3          | 4                      | 36                |
| Conv11_2     | 1x1          | 4                      | 4                 |
| Total        |              |                        | 8,732             |

The rationale is straightforward. Earlier feature maps have a small stride and a small receptive field, so they are good at detecting small objects. Later feature maps have a coarse stride and a large receptive field, so they are good at detecting large objects. Predicting at all of these scales naturally handles a wide range of object sizes without an explicit image pyramid [1].

SSD512 keeps the same idea but adds one more feature map and uses a 512x512 input, which yields 24,564 default boxes in total [1].

### What are default boxes (anchors)?

At each cell of each chosen feature map, SSD places a small set of default boxes with predefined aspect ratios (typically 1, 2, 1/2, 3, 1/3) and two scales for the aspect ratio 1 box [1]. The aspect ratios determine how many default boxes per cell each layer uses (4 or 6 in SSD300).

For every default box the network predicts:

* Confidence scores for each of the C+1 classes (C foreground classes plus background).
* Four offsets relative to the default box: delta cx, delta cy, delta w, delta h. These are decoded against the default box to produce the final predicted bounding box.

Because the default boxes are tiled densely across multiple feature maps, SSD300 produces 8,732 candidate detections per image and SSD512 produces 24,564 [1]. These numbers are what motivate non maximum suppression at inference time and the careful sampling strategy at training time.

### How is SSD trained?

The training loss is a weighted sum of two components [1]:

1. **Confidence loss**: a softmax cross entropy loss over the C+1 class scores, computed on the matched default boxes (positives) and on a chosen subset of negatives.
2. **Localization loss**: a Smooth L1 loss between the predicted box offsets and the ground truth offsets, computed only on positive matches.

A default box is considered a positive match for a ground truth box if its Jaccard overlap (IoU) with that ground truth is above 0.5, or if it is the box with the highest IoU for some ground truth (so that every ground truth has at least one match) [1].

### What is hard negative mining in SSD?

Single stage detectors face a severe foreground background imbalance. Out of thousands of default boxes per image, only a handful match any ground truth, so the rest are negatives. Treating them all as training examples drowns the loss in easy backgrounds.

SSD addresses this with hard negative mining. As the paper describes it, after matching positives "we sort them using the highest confidence loss for each default box and pick the top ones so that the ratio between the negatives and positives is at most 3:1" [1]. This focuses learning on the hard examples and stabilizes training significantly compared to using all negatives.

The later RetinaNet paper would replace this heuristic with a smoother, principled alternative called focal loss [4].

### How much does data augmentation help SSD?

The original paper credits a substantial chunk of SSD's accuracy to its data augmentation pipeline. Each training image is randomly transformed using one of several strategies [1]:

* Use the entire original image.
* Sample a patch with a minimum Jaccard overlap of 0.1, 0.3, 0.5, 0.7 or 0.9 with one of the ground truth boxes.
* Sample a random patch.

Patches are resized to the network input size, photometric distortions are applied, and the image is randomly horizontally flipped. The authors report that "we can improve 8.8% mAP with this sampling strategy" on VOC 2007, one of the largest single contributions reported in the paper [1].

## What are the SSD300 and SSD512 variants?

The original SSD paper presented two main variants distinguished by input resolution [1].

| Variant | Input size | VOC 2007 test mAP (07+12) | COCO test-dev2015 mAP | FPS on Titan X |
| ------- | ---------- | -------------------------- | ---------------------- | -------------- |
| SSD300  | 300x300    | 74.3%                      | 23.2                   | ~59            |
| SSD512  | 512x512    | 76.8%                      | 26.8                   | ~22            |

The trade off is the standard one for resolution: larger inputs see more pixels per object and detect small objects better, at higher computational cost. The 59 FPS for SSD300 and 22 FPS for SSD512 are measured at batch size 8 on a Titan X; at batch size 1 the same models run at roughly 46 FPS and 19 FPS respectively [1].

## How does SSD compare to YOLO and Faster R-CNN?

To understand where SSD sat in 2016, here is a comparison with the strongest published baselines from the same paper [1].

| Model                  | VOC 2007 mAP | COCO mAP@[0.5:0.95] | FPS (Titan X) |
| ---------------------- | ------------ | -------------------- | -------------- |
| Faster R-CNN (VGG-16)  | 73.2%        | ~21.9                | 7              |
| YOLO v1                | 63.4%        | n/a                  | 45             |
| SSD300                 | 74.3%        | 23.2                 | 59             |
| SSD512                 | 76.8%        | 26.8                 | 22             |

This was the headline result. SSD300 was both faster and more accurate than Faster R-CNN VGG-16, and SSD512 widened the accuracy gap further while staying close to real time [1]. Relative to [YOLO](/wiki/yolo) v1, SSD traded a small amount of speed for a large accuracy gain (74.3% versus 63.4% mAP), chiefly because the dense multi scale default boxes give SSD far more, and better placed, candidate boxes than YOLO's coarse single grid.

## Benchmark numbers in context

SSD sits in a long lineage of detectors. The table below places it against other widely cited methods. Numbers are approximate and depend on backbone and training details, but they give a useful sense of the design space.

| Detector       | Year | Stage    | Anchor based | Reported FPS  | COCO mAP@[0.5:0.95] | Key innovation                                              |
| -------------- | ---- | -------- | ------------ | ------------- | -------------------- | ----------------------------------------------------------- |
| [Faster R-CNN](/wiki/faster_r_cnn)   | 2015 | Two      | Yes          | 7 (VGG-16)    | ~21.9 (VGG)          | Region Proposal Network, end to end RoI based detection     |
| SSD            | 2015 | One      | Yes          | 22 to 59      | 23.2 to 26.8         | Multi scale dense prediction with default boxes             |
| [YOLO](/wiki/yolo) v1| 2016 | One      | No (grid)    | 45            | n/a                  | Whole image regression of boxes from a coarse grid          |
| YOLO v2        | 2017 | One      | Yes          | 40 to 90      | ~21.6                | Anchor boxes, batch normalization, multi scale training     |
| [Mask R-CNN](/wiki/mask_r_cnn) | 2017 | Two      | Yes          | 5             | ~37.1                | Adds a parallel mask branch with RoIAlign                   |
| RetinaNet      | 2017 | One      | Yes          | 5 to 11       | ~39.1                | Focal loss removes the need for hard negative mining        |
| YOLO v3        | 2018 | One      | Yes          | 20 to 50      | 33.0                 | Darknet-53 backbone with multi scale predictions            |
| CenterNet      | 2019 | One      | No (anchor free) | 28 to 142  | ~42.1                | Predict object as a center keypoint plus size offsets       |
| YOLO v4 / v5   | 2020 | One      | Yes          | 30 to 140     | ~43 to 50            | CSPNet backbones, mosaic augmentation, careful tricks       |
| [DETR](/wiki/detr)   | 2020 | One (set prediction) | No   | ~28           | ~42 to 44            | Transformer encoder decoder with bipartite matching loss    |
| YOLO v8 / v10  | 2023+ | One     | Mostly anchor free | 100+ on T4 | ~45 to 54            | Decoupled heads, anchor free, end to end NMS removal in v10 |

The headline takeaway is that SSD established the design pattern (single stage, multi scale, dense anchors) that dominated single stage detection until the focal loss and anchor free waves arrived.

## What variants and successors followed SSD?

SSD inspired a long line of follow up work that kept the basic structure (single stage, multi scale features, dense anchors) and improved various pieces.

| Variant         | Year | Main idea                                                                                          | Notes                                                                                                |
| --------------- | ---- | --------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------- |
| DSSD            | 2017 | Add deconvolution layers and skip connections, swap VGG for ResNet-101                              | DSSD513 reaches 81.5% mAP on VOC 2007 and 33.2 mAP on COCO; better on small objects [3]              |
| MobileNet-SSD   | 2017 | Replace VGG backbone with MobileNet for mobile and embedded deployment                              | Used heavily in TensorFlow Object Detection API model zoo and on edge devices [13]                   |
| FSSD            | 2018 | Feature Fusion SSD; concatenates features from multiple layers before producing prediction maps     | Improves small object detection while staying fast                                                   |
| RFB Net         | 2018 | Receptive Field Block, modeled after the human visual receptive field                                | Adds dilated convolutions in a parallel branch on top of SSD                                         |
| SSDLite         | 2018 | Replace standard convolutions in SSD heads with depthwise separable convolutions; pair with [MobileNet](/wiki/mobilenet) V2 / V3 | Designed for mobile and on-device inference; the standard mobile detection baseline for years [5]        |

The MobileNet-SSD and SSDLite combinations turned out to be the most enduring deployments. They are still in routine use in the TensorFlow Object Detection API model zoo, in TensorFlow Lite, in Coral Edge TPU sample apps, and in OpenCV's DNN module [13].

## How does SSD handle class imbalance?

The class imbalance question is worth its own subsection because it shapes how the field evolved after SSD.

Single stage detectors evaluate every default box at every spatial location. The vast majority are background. If the loss weights all of them equally, the gradient is dominated by easy negatives and the model never learns to discriminate hard ones. The two stage detectors avoid this implicitly: their region proposal network filters out most background candidates before the classifier sees them.

SSD addressed the imbalance with the 3:1 hard negative mining trick described above [1]. It works, but it has the feel of a tuned heuristic, and the ratio is one more hyperparameter.

In 2017 Tsung-Yi Lin and colleagues at Facebook AI Research published *Focal Loss for Dense Object Detection*, which introduced focal loss and the RetinaNet detector [4]. Focal loss reshapes cross entropy with a modulating factor (1 - p_t)^gamma so that easy, well classified examples contribute much less to the loss than hard ones. This lets the network train on every default box without any sampling, and it pushed single stage accuracy past two stage methods on COCO for the first time [4]. RetinaNet effectively replaced the SSD style sampling heuristic with a principled loss design.

Later anchor free detectors such as CenterNet (2019), FCOS (2019) and corner based detectors removed default boxes entirely, and the transformer based [DETR](/wiki/detr) (2020) reframed detection as a set prediction problem with bipartite matching. Those follow up directions all owe something to the dense prediction view that SSD popularized.

## What is SSD used for?

SSD remains a practical choice in scenarios that prize predictability, low latency and easy deployment over the absolute best accuracy.

| Domain                       | Typical use                                                                                       |
| ---------------------------- | ------------------------------------------------------------------------------------------------- |
| Mobile and embedded vision   | MobileNet-SSD and SSDLite for phones, smart cameras, doorbells, drones                            |
| Robotics                     | Grasping pipelines, navigation, semantic mapping where a fixed object vocabulary is enough        |
| Autonomous driving (early)   | Vehicle, pedestrian and sign detection in research stacks before transformer detectors took over  |
| Surveillance and security    | Person and vehicle detection at the edge, sometimes paired with a tracking module                 |
| Augmented reality            | Lightweight on device detection of common objects                                                 |
| Industrial inspection        | Defect detection on conveyor lines where the class set is small and labeled data is plentiful     |
| Education and tutorials      | SSD is a teaching staple in computer vision courses; the architecture is small enough to follow  |

A common production pattern is to fine tune a MobileNet-SSD or SSDLite model on a custom dataset of a few thousand images, export it to TensorFlow Lite or ONNX, and run inference on a Raspberry Pi, Jetson Nano or mobile phone at tens of frames per second.

## Implementations

SSD has been ported to almost every deep learning framework. The most commonly used implementations include:

| Implementation                              | Notes                                                                                                       |
| ------------------------------------------- | ----------------------------------------------------------------------------------------------------------- |
| weiliu89/caffe (branch `ssd`)               | The original Caffe implementation by Wei Liu, considered the canonical reference [12]                       |
| TensorFlow Object Detection API             | Maintained by Google; ships pretrained MobileNet-SSD, ResNet-SSD and SSDLite checkpoints in its model zoo [13]   |
| `torchvision.models.detection.ssd300_vgg16` | Official PyTorch port; a faithful SSD300 with a VGG-16 backbone [14]                                             |
| `torchvision.models.detection.ssdlite320_mobilenet_v3_large` | Mobile variant in torchvision with a 320x320 input and a MobileNet V3 Large backbone [14]        |
| MMDetection (OpenMMLab)                     | Provides multiple SSD variants and benchmarks within a unified detection framework                         |
| OpenCV DNN                                  | Reads MobileNet-SSD Caffe and TensorFlow models, frequently used in C++ and Python deployment              |
| Many community PyTorch ports                | `amdegroot/ssd.pytorch`, `lufficc/SSD`, `qfgaohao/pytorch-ssd` and others                                  |

For mobile deployment, SSDLite combined with a MobileNet V2 or V3 backbone is the most common choice; it converts cleanly to TensorFlow Lite and to Core ML for iOS [5].

## Strengths

* **Single stage and end to end**: no separate region proposal step. Detection is one forward pass plus non maximum suppression [1].
* **Real time on commodity hardware**: SSD300 hit 59 FPS on a Titan X in 2016, and current MobileNet variants hit similar speeds on phones [1].
* **Multi scale by construction**: predicting at six feature map resolutions handles a wide range of object sizes without an image pyramid [1].
* **Easy to deploy**: the pure convolutional structure exports cleanly to ONNX, TensorRT, TFLite and Core ML, which is one reason SSD survives in mobile and embedded stacks long after newer detectors have surpassed it on benchmarks.
* **Reasonable training cost**: training takes a day or two on a single GPU for VOC scale data, and a week or so for COCO scale data, which is small compared to many newer detectors.
* **A teaching staple**: the design is small enough to study end to end, which is why almost every computer vision course covers it.

## Limitations

* **Small object accuracy**: the original SSD lags two stage methods on small objects. This is because the deepest feature maps that detect small objects (Conv4_3 in SSD300) lack high level semantic context. DSSD, FSSD and FPN style designs address this with top down pathways [3].
* **Hand designed default boxes**: aspect ratios and scales are tuned by hand. Newer anchor free detectors avoid this hyperparameter set entirely.
* **Heuristic hard negative mining**: the 3:1 sampling ratio works but feels less clean than focal loss [4].
* **Sensitive to backbone and image size**: a great deal of the practical accuracy comes from the backbone and input resolution. SSD300 with VGG-16 is much weaker than SSD512 with ResNet-101.
* **Surpassed by later methods on accuracy**: RetinaNet (2017), then anchor free detectors (CenterNet, FCOS), then transformer based detectors ([DETR](/wiki/detr) and follow ups), and the current YOLO families all beat the original SSD on COCO [4].

## Influence and legacy

The original SSD paper has been cited tens of thousands of times and is one of the most cited object detection papers [1]. Its influence shows up in several distinct ways.

* **The single stage anchor based template**: YOLO v2 adopted anchor boxes after SSD demonstrated their value, and the entire family of single stage anchor based detectors that dominated 2016 to 2020 owes its structure to SSD [8].
* **Multi scale dense prediction**: predicting at multiple feature map resolutions later evolved into Feature Pyramid Networks (FPN), which became standard in both stages of two stage detectors and in single stage detectors like RetinaNet [4].
* **MobileNet-SSD and SSDLite**: these continue to be deployed on millions of devices through TensorFlow Lite and Core ML, especially in low cost cameras, doorbells, and embedded vision systems [5].
* **Pedagogical anchor**: SSD remains the standard introduction to single stage detection in textbooks and university courses.

The field has moved on in benchmark accuracy, but the SSD architecture is still common in production. When you need to detect a fixed set of object classes on a small device with a tight latency budget, an SSDLite model with a [MobileNet](/wiki/mobilenet) backbone is often still the right answer.

## ELI5: SSD explained simply

Imagine you want to find every cat, dog and car in a photo, fast. A slower method first circles thousands of "maybe something is here" regions, then looks at each circle one by one. SSD skips that. It lays a fixed grid of pre-shaped boxes (some tall, some wide, some big, some small) over the picture all at once, and in a single glance it guesses, for each box, "is there an object here, and if so, what is it, and exactly where are its edges?" Because it looks at several zoomed-in and zoomed-out versions of the picture at the same time, it can spot both a tiny faraway sign and a big nearby truck. Doing everything in one look is what makes it quick enough to run on a phone or a small camera.

## See also

* [Object detection](/wiki/object_detection)
* [YOLO (object detection)](/wiki/yolo)
* [Faster R-CNN](/wiki/faster_r_cnn)
* [Mask R-CNN](/wiki/mask_r_cnn)
* [DETR](/wiki/detr)
* [VGG](/wiki/vgg)
* [MobileNet](/wiki/mobilenet)
* [ResNet](/wiki/resnet)
* [Convolutional Neural Network](/wiki/convolutional_neural_network)

## References

1. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., Berg, A. C. (2016). *SSD: Single Shot MultiBox Detector*. ECCV 2016. arXiv:1512.02325. [https://arxiv.org/abs/1512.02325](https://arxiv.org/abs/1512.02325)
2. Erhan, D., Szegedy, C., Toshev, A., Anguelov, D. (2014). *Scalable Object Detection using Deep Neural Networks*. CVPR 2014. arXiv:1312.2249. [https://arxiv.org/abs/1312.2249](https://arxiv.org/abs/1312.2249)
3. Fu, C.-Y., Liu, W., Ranga, A., Tyagi, A., Berg, A. C. (2017). *DSSD: Deconvolutional Single Shot Detector*. arXiv:1701.06659. [https://arxiv.org/abs/1701.06659](https://arxiv.org/abs/1701.06659)
4. Lin, T.-Y., Goyal, P., Girshick, R., He, K., Dollar, P. (2017). *Focal Loss for Dense Object Detection*. ICCV 2017. arXiv:1708.02002. [https://arxiv.org/abs/1708.02002](https://arxiv.org/abs/1708.02002)
5. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.-C. (2018). *MobileNetV2: Inverted Residuals and Linear Bottlenecks*. CVPR 2018. arXiv:1801.04381. [https://arxiv.org/abs/1801.04381](https://arxiv.org/abs/1801.04381)
6. Howard, A. G., Zhu, M., Chen, B., et al. (2017). *MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications*. arXiv:1704.04861. [https://arxiv.org/abs/1704.04861](https://arxiv.org/abs/1704.04861)
7. Redmon, J., Divvala, S., Girshick, R., Farhadi, A. (2016). *You Only Look Once: Unified, Real-Time Object Detection*. CVPR 2016. arXiv:1506.02640. [https://arxiv.org/abs/1506.02640](https://arxiv.org/abs/1506.02640)
8. Redmon, J., Farhadi, A. (2017). *YOLO9000: Better, Faster, Stronger*. CVPR 2017. arXiv:1612.08242. [https://arxiv.org/abs/1612.08242](https://arxiv.org/abs/1612.08242)
9. Redmon, J., Farhadi, A. (2018). *YOLOv3: An Incremental Improvement*. arXiv:1804.02767. [https://arxiv.org/abs/1804.02767](https://arxiv.org/abs/1804.02767)
10. Ren, S., He, K., Girshick, R., Sun, J. (2015). *Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks*. NeurIPS 2015. arXiv:1506.01497. [https://arxiv.org/abs/1506.01497](https://arxiv.org/abs/1506.01497)
11. He, K., Gkioxari, G., Dollar, P., Girshick, R. (2017). *Mask R-CNN*. ICCV 2017. arXiv:1703.06870. [https://arxiv.org/abs/1703.06870](https://arxiv.org/abs/1703.06870)
12. Liu, W. (2016). *SSD: Single Shot MultiBox Detector* (Caffe implementation, branch `ssd`). GitHub: [https://github.com/weiliu89/caffe/tree/ssd](https://github.com/weiliu89/caffe/tree/ssd)
13. TensorFlow Object Detection API model zoo. [https://github.com/tensorflow/models/tree/master/research/object_detection](https://github.com/tensorflow/models/tree/master/research/object_detection)
14. PyTorch torchvision detection models documentation: [https://docs.pytorch.org/vision/main/models/generated/torchvision.models.detection.ssd300_vgg16.html](https://docs.pytorch.org/vision/main/models/generated/torchvision.models.detection.ssd300_vgg16.html) and [https://docs.pytorch.org/vision/main/models/generated/torchvision.models.detection.ssdlite320_mobilenet_v3_large.html](https://docs.pytorch.org/vision/main/models/generated/torchvision.models.detection.ssdlite320_mobilenet_v3_large.html)
15. Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., Zisserman, A. (2010). *The PASCAL Visual Object Classes (VOC) Challenge*. International Journal of Computer Vision.
16. Lin, T.-Y., Maire, M., Belongie, S., et al. (2014). *Microsoft COCO: Common Objects in Context*. ECCV 2014. arXiv:1405.0312. [https://arxiv.org/abs/1405.0312](https://arxiv.org/abs/1405.0312)