SSD (Single Shot MultiBox Detector)
Last reviewed
May 1, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,623 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 1, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,623 words
Add missing citations, update stale details, or suggest a clearer explanation.
SSD (Single Shot MultiBox Detector) is a single stage object detection model that predicts bounding boxes and per class confidence scores in one forward pass through a convolutional network. It was one of the first single stage detectors to reach accuracy comparable to two stage methods such as Faster R-CNN while still running at real time speeds on a single GPU. SSD was introduced by Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu and Alexander C. Berg in the paper SSD: Single Shot MultiBox Detector, first posted on arXiv on December 8, 2015 (arXiv:1512.02325) and published at the European Conference on Computer Vision (ECCV) in 2016.
The model is associated with several institutions: UNC Chapel Hill (Wei Liu and Alexander Berg, who was Wei Liu's advisor), Zoox Inc. (Dragomir Anguelov), Google Inc. (Dumitru Erhan, Christian Szegedy, Scott Reed) and the University of Michigan, Ann Arbor (Scott Reed at the time). The system was released as open source as a fork of Caffe under the weiliu89/caffe repository on GitHub, which became the canonical reference implementation for years.
By late 2015, deep object detection had split into two camps. Two stage region based methods (R-CNN, Fast R-CNN, Faster R-CNN) dominated the accuracy leaderboards on PASCAL VOC and the COCO challenge. Faster R-CNN with a VGG-16 backbone reached about 73.2% mAP on PASCAL VOC 2007 test, but it ran at roughly 7 frames per second on a Titan X. That was too slow for many production scenarios such as robotics, mobile inference or on-board car perception.
On the other side, the first version of YOLO (You Only Look Once), published by Joseph Redmon and colleagues at CVPR 2016, regressed boxes and classes directly from a single grid of cells. It hit 45 frames per second but only 63.4% mAP on the same benchmark. So users had to choose between accurate and slow or fast and inaccurate.
SSD was the first published method to break that trade off in a clean way. It removed the region proposal stage entirely, like YOLO, and added two ideas that pushed accuracy back up: predictions from multiple feature map resolutions, and a dense set of default boxes (anchor boxes) at each spatial location of those feature maps. With a 300x300 input it ran at 59 FPS on a Titan X and reached 74.3% mAP on VOC 2007 test. With a 512x512 input it reached 76.8% mAP at about 22 FPS, surpassing Faster R-CNN on accuracy while still being faster.
The MultiBox in the name refers back to earlier work. In the 2014 CVPR paper Scalable Object Detection using Deep Neural Networks, Dumitru Erhan, Christian Szegedy, Alexander Toshev and Dragomir Anguelov introduced a system that learned to regress class agnostic boxes with confidence scores. SSD inherits the spirit of that approach, removes the class agnostic step and folds the box and class prediction into a single dense convolutional head.
SSD is built on top of a convolutional backbone with a set of additional convolutional feature layers added on top, plus convolutional predictors attached at multiple resolutions.
The original SSD uses VGG-16 as the backbone, truncated before the classification layers. The fully connected layers fc6 and fc7 in VGG are converted to convolutional layers (with subsampling) so that the output of fc7 is a 19x19 feature map for a 300x300 input. The atrous (dilated) convolution trick is applied so the receptive field stays large. After the converted fc7, a series of extra convolutional layers progressively reduce spatial resolution, producing feature maps of 10x10, 5x5, 3x3 and 1x1.
Other backbones followed in later work and library implementations: ResNet-101 in DSSD, MobileNet V1, V2 and V3 in MobileNet-SSD and SSDLite, and various lightweight backbones in community ports.
In SSD300, predictions are made at six feature map resolutions:
| Feature map | Spatial size | Default boxes per cell | Boxes contributed |
|---|---|---|---|
| Conv4_3 | 38x38 | 4 | 5,776 |
| Conv7 (FC7) | 19x19 | 6 | 2,166 |
| Conv8_2 | 10x10 | 6 | 600 |
| Conv9_2 | 5x5 | 6 | 150 |
| Conv10_2 | 3x3 | 4 | 36 |
| Conv11_2 | 1x1 | 4 | 4 |
| Total | 8,732 |
The rationale is straightforward. Earlier feature maps have a small stride and a small receptive field, so they are good at detecting small objects. Later feature maps have a coarse stride and a large receptive field, so they are good at detecting large objects. Predicting at all of these scales naturally handles a wide range of object sizes without an explicit image pyramid.
SSD512 keeps the same idea but adds one more feature map and uses a 512x512 input, which yields about 24,564 default boxes in total.
At each cell of each chosen feature map, SSD places a small set of default boxes with predefined aspect ratios (typically 1, 2, 1/2, 3, 1/3) and two scales for the aspect ratio 1 box. The aspect ratios determine how many default boxes per cell each layer uses (4 or 6 in SSD300).
For every default box the network predicts:
Because the default boxes are tiled densely across multiple feature maps, SSD300 produces 8,732 candidate detections per image and SSD512 produces about 24,564. These numbers are what motivate non maximum suppression at inference time and the careful sampling strategy at training time.
The training loss is a weighted sum of two components:
A default box is considered a positive match for a ground truth box if its Jaccard overlap (IoU) with that ground truth is above 0.5, or if it is the box with the highest IoU for some ground truth (so that every ground truth has at least one match).
Single stage detectors face a severe foreground background imbalance. Out of thousands of default boxes per image, only a handful match any ground truth, so the rest are negatives. Treating them all as training examples drowns the loss in easy backgrounds.
SSD addresses this with hard negative mining. After matching positives, the negatives are sorted by their confidence loss (how badly the network is currently classifying them as background), and only the top scoring negatives are kept until the ratio of negatives to positives is at most 3:1. This focuses learning on the hard examples and stabilizes training significantly compared to using all negatives.
The later RetinaNet paper would replace this heuristic with a smoother, principled alternative called focal loss.
The original paper credits a substantial chunk of SSD's accuracy to its data augmentation pipeline. Each training image is randomly transformed using one of several strategies:
Patches are resized to the network input size, photometric distortions are applied, and the image is randomly horizontally flipped. The authors report a roughly 8% mAP gain on VOC 2007 from this augmentation alone, which is one of the largest single contributions reported in the paper.
The original SSD paper presented two main variants distinguished by input resolution.
| Variant | Input size | VOC 2007 test mAP (07+12) | COCO test-dev2015 mAP | FPS on Titan X |
|---|---|---|---|---|
| SSD300 | 300x300 | 74.3% | 23.2 | ~59 |
| SSD512 | 512x512 | 76.8% | 26.8 | ~22 |
The trade off is the standard one for resolution: larger inputs see more pixels per object and detect small objects better, at higher computational cost.
To understand where SSD sat in 2016, here is a comparison with the strongest published baselines from the same paper.
| Model | VOC 2007 mAP | COCO mAP@[0.5:0.95] | FPS (Titan X) |
|---|---|---|---|
| Faster R-CNN (VGG-16) | 73.2% | ~21.9 | 7 |
| YOLO v1 | 63.4% | n/a | 45 |
| SSD300 | 74.3% | 23.2 | 59 |
| SSD512 | 76.8% | 26.8 | 22 |
This was the headline result. SSD300 was both faster and more accurate than Faster R-CNN VGG-16, and SSD512 widened the accuracy gap further while staying close to real time.
SSD inspired a long line of follow up work that kept the basic structure (single stage, multi scale features, dense anchors) and improved various pieces.
| Variant | Year | Main idea | Notes |
|---|---|---|---|
| DSSD | 2017 | Add deconvolution layers and skip connections, swap VGG for ResNet-101 | DSSD513 reaches 81.5% mAP on VOC 2007 and 33.2 mAP on COCO; better on small objects |
| MobileNet-SSD | 2017 | Replace VGG backbone with MobileNet for mobile and embedded deployment | Used heavily in TensorFlow Object Detection API model zoo and on edge devices |
| FSSD | 2018 | Feature Fusion SSD; concatenates features from multiple layers before producing prediction maps | Improves small object detection while staying fast |
| RFB Net | 2018 | Receptive Field Block, modeled after the human visual receptive field | Adds dilated convolutions in a parallel branch on top of SSD |
| SSDLite | 2018 | Replace standard convolutions in SSD heads with depthwise separable convolutions; pair with MobileNet V2 / V3 | Designed for mobile and on-device inference; the standard mobile detection baseline for years |
The MobileNet-SSD and SSDLite combinations turned out to be the most enduring deployments. They are still in routine use in the TensorFlow Object Detection API model zoo, in TensorFlow Lite, in Coral Edge TPU sample apps, and in OpenCV's DNN module.
The class imbalance question is worth its own subsection because it shapes how the field evolved after SSD.
Single stage detectors evaluate every default box at every spatial location. The vast majority are background. If the loss weights all of them equally, the gradient is dominated by easy negatives and the model never learns to discriminate hard ones. The two stage detectors avoid this implicitly: their region proposal network filters out most background candidates before the classifier sees them.
SSD addressed the imbalance with the 3:1 hard negative mining trick described above. It works, but it has the feel of a tuned heuristic, and the ratio is one more hyperparameter.
In 2017 Tsung-Yi Lin and colleagues at Facebook AI Research published Focal Loss for Dense Object Detection, which introduced focal loss and the RetinaNet detector. Focal loss reshapes cross entropy with a modulating factor (1 - p_t)^gamma so that easy, well classified examples contribute much less to the loss than hard ones. This lets the network train on every default box without any sampling, and it pushed single stage accuracy past two stage methods on COCO for the first time. RetinaNet effectively replaced the SSD style sampling heuristic with a principled loss design.
Later anchor free detectors such as CenterNet (2019), FCOS (2019) and corner based detectors removed default boxes entirely, and the transformer based DETR (2020) reframed detection as a set prediction problem with bipartite matching. Those follow up directions all owe something to the dense prediction view that SSD popularized.
SSD remains a practical choice in scenarios that prize predictability, low latency and easy deployment over the absolute best accuracy.
| Domain | Typical use |
|---|---|
| Mobile and embedded vision | MobileNet-SSD and SSDLite for phones, smart cameras, doorbells, drones |
| Robotics | Grasping pipelines, navigation, semantic mapping where a fixed object vocabulary is enough |
| Autonomous driving (early) | Vehicle, pedestrian and sign detection in research stacks before transformer detectors took over |
| Surveillance and security | Person and vehicle detection at the edge, sometimes paired with a tracking module |
| Augmented reality | Lightweight on device detection of common objects |
| Industrial inspection | Defect detection on conveyor lines where the class set is small and labeled data is plentiful |
| Education and tutorials | SSD is a teaching staple in computer vision courses; the architecture is small enough to follow |
A common production pattern is to fine tune a MobileNet-SSD or SSDLite model on a custom dataset of a few thousand images, export it to TensorFlow Lite or ONNX, and run inference on a Raspberry Pi, Jetson Nano or mobile phone at tens of frames per second.
SSD sits in a long lineage of detectors. The table below places it against other widely cited methods. Numbers are approximate and depend on backbone and training details, but they give a useful sense of the design space.
| Detector | Year | Stage | Anchor based | Reported FPS | COCO mAP@[0.5:0.95] | Key innovation |
|---|---|---|---|---|---|---|
| Faster R-CNN | 2015 | Two | Yes | 7 (VGG-16) | ~21.9 (VGG) | Region Proposal Network, end to end RoI based detection |
| SSD | 2015 | One | Yes | 22 to 59 | 23.2 to 26.8 | Multi scale dense prediction with default boxes |
| YOLO v1 | 2016 | One | No (grid) | 45 | n/a | Whole image regression of boxes from a coarse grid |
| YOLO v2 | 2017 | One | Yes | 40 to 90 | ~21.6 | Anchor boxes, batch normalization, multi scale training |
| Mask R-CNN | 2017 | Two | Yes | 5 | ~37.1 | Adds a parallel mask branch with RoIAlign |
| RetinaNet | 2017 | One | Yes | 5 to 11 | ~39.1 | Focal loss removes the need for hard negative mining |
| YOLO v3 | 2018 | One | Yes | 20 to 50 | 33.0 | Darknet-53 backbone with multi scale predictions |
| CenterNet | 2019 | One | No (anchor free) | 28 to 142 | ~42.1 | Predict object as a center keypoint plus size offsets |
| YOLO v4 / v5 | 2020 | One | Yes | 30 to 140 | ~43 to 50 | CSPNet backbones, mosaic augmentation, careful tricks |
| DETR | 2020 | One (set prediction) | No | ~28 | ~42 to 44 | Transformer encoder decoder with bipartite matching loss |
| YOLO v8 / v10 | 2023+ | One | Mostly anchor free | 100+ on T4 | ~45 to 54 | Decoupled heads, anchor free, end to end NMS removal in v10 |
The headline takeaway is that SSD established the design pattern (single stage, multi scale, dense anchors) that dominated single stage detection until the focal loss and anchor free waves arrived.
SSD has been ported to almost every deep learning framework. The most commonly used implementations include:
| Implementation | Notes |
|---|---|
weiliu89/caffe (branch ssd) | The original Caffe implementation by Wei Liu, considered the canonical reference |
| TensorFlow Object Detection API | Maintained by Google; ships pretrained MobileNet-SSD, ResNet-SSD and SSDLite checkpoints in its model zoo |
torchvision.models.detection.ssd300_vgg16 | Official PyTorch port; a faithful SSD300 with a VGG-16 backbone |
torchvision.models.detection.ssdlite320_mobilenet_v3_large | Mobile variant in torchvision with a 320x320 input and a MobileNet V3 Large backbone |
| MMDetection (OpenMMLab) | Provides multiple SSD variants and benchmarks within a unified detection framework |
| OpenCV DNN | Reads MobileNet-SSD Caffe and TensorFlow models, frequently used in C++ and Python deployment |
| Many community PyTorch ports | amdegroot/ssd.pytorch, lufficc/SSD, qfgaohao/pytorch-ssd and others |
For mobile deployment, SSDLite combined with a MobileNet V2 or V3 backbone is the most common choice; it converts cleanly to TensorFlow Lite and to Core ML for iOS.
The original SSD paper has been cited tens of thousands of times and is one of the most cited object detection papers. Its influence shows up in several distinct ways.
The field has moved on in benchmark accuracy, but the SSD architecture is still common in production. When you need to detect a fixed set of object classes on a small device with a tight latency budget, an SSDLite model with a MobileNet backbone is often still the right answer.