DETR (DEtection TRansformer) is a pioneering object detection model that reformulates the detection task as a direct set prediction problem using a transformer architecture. Introduced by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko from Facebook AI Research (FAIR) in 2020, DETR was presented at the European Conference on Computer Vision (ECCV 2020). The model eliminates the need for many hand-designed components that traditional detectors rely on, such as anchor generation, non-maximum suppression (NMS), and region proposal networks (RPNs). By treating object detection as a set prediction problem and using bipartite matching with the Hungarian algorithm, DETR provides a clean, end-to-end trainable pipeline that matches ground-truth objects to predictions in a one-to-one fashion.
DETR demonstrated accuracy and run-time performance on par with the highly optimized Faster R-CNN baseline on the COCO benchmark, while using a conceptually simpler design. Its publication sparked a wave of follow-up research that has produced numerous variants, including Deformable DETR, DINO, RT-DETR, and Co-DETR, each addressing specific limitations of the original model. The DETR family of detectors has since reshaped the object detection landscape and demonstrated that transformers can serve as a universal architecture for computer vision tasks.
Before DETR, the dominant paradigm in object detection relied on multi-stage or single-stage pipelines built around convolutional neural networks. Two-stage detectors like Faster R-CNN first generated region proposals using a Region Proposal Network (RPN), then classified and refined each proposal independently. Single-stage detectors such as YOLO and SSD predicted bounding boxes directly from feature maps at multiple scales. Both approaches depended heavily on hand-crafted components:
These components required careful tuning of hyperparameters (anchor sizes, aspect ratios, NMS thresholds, IoU thresholds) and introduced implicit assumptions about object geometry. The DETR authors argued that a simpler approach was possible: directly predict a fixed-size set of detections using a transformer, then match predictions to ground truth using bipartite matching. This would remove the need for anchors, NMS, and many other hand-crafted heuristics.
The set prediction formulation was inspired by earlier work on set-based losses for multi-label classification and keypoint detection, but DETR was the first to apply it successfully at scale for general object detection.
DETR's architecture consists of three main components: a CNN backbone for feature extraction, a transformer encoder-decoder for global reasoning, and feed-forward network (FFN) prediction heads for classification and bounding box regression.
The backbone is a standard ResNet (either ResNet-50 or ResNet-101) pretrained on ImageNet. Given an input image of size 3 x H x W, the backbone produces a lower-resolution feature map of shape C x H/32 x W/32, where C is typically 2048 for ResNet. A 1x1 convolution then reduces the channel dimension from C to a smaller dimension d (256 by default), producing a feature map of shape d x H/32 x W/32. This feature map is flattened into a sequence of d-dimensional feature vectors, each corresponding to a spatial location in the downsampled image.
The authors also experimented with a DC5 (dilated C5) variant, which uses dilated convolutions in the final stage of ResNet to produce a stride-16 feature map instead of stride-32. This increases spatial resolution at the cost of higher computation, improving performance on small objects.
Since transformers are permutation-invariant, spatial positional information must be explicitly added. DETR supplements the flattened feature sequence with fixed sinusoidal positional encodings, similar to those used in the original Transformer model for natural language processing. These encodings are two-dimensional, separately encoding the x and y coordinates of each spatial position using sine and cosine functions with different frequencies. The positional encodings are added to the input at every attention layer in the encoder, ensuring that spatial information is preserved throughout the network.
The encoder is a standard transformer encoder with multi-head self-attention and feed-forward layers. It takes the sequence of positional-encoded image features as input and applies global self-attention, allowing every spatial location to attend to every other location. This global reasoning is critical for disentangling objects, especially in scenes with occlusion or overlapping instances. The default encoder configuration uses 6 layers, each with 8 attention heads, a hidden dimension of 256, and a feed-forward network (FFN) hidden dimension of 2048, with a dropout rate of 0.1.
The encoder's self-attention has O(n^2) complexity relative to the number of spatial positions n, which is the product of the downsampled height and width. For a typical 800x1066 input image, the feature map is roughly 25x34, yielding about 850 tokens.
The decoder follows the standard transformer decoder architecture, consisting of multi-head self-attention, multi-head cross-attention (attending to encoder outputs), and feed-forward layers. The key difference from standard sequence-to-sequence transformers is that DETR decodes all N objects in parallel rather than autoregressively. The decoder takes as input a set of N learned positional embeddings called object queries. The default decoder uses 6 layers, matching the encoder configuration.
Object queries are a fixed set of N learned embedding vectors (where N = 100 by default) that serve as input slots for the decoder. Each object query is responsible for detecting at most one object. During training, these queries learn to specialize in attending to different regions and object types through the self-attention and cross-attention mechanisms. The object queries do not correspond to specific locations or regions in the image at initialization; they are randomly initialized learned parameters that develop spatial and semantic specialization through training.
The N output embeddings from the decoder are independently decoded into box coordinates and class labels by the prediction heads. If fewer than N objects are present in the image, the remaining slots are assigned a special "no object" (background) class.
Each decoder output embedding is passed through two separate feed-forward networks:
The bounding box predictions use normalized coordinates relative to the image dimensions, eliminating the need for anchor-based parameterization.
DETR's training procedure relies on a bipartite matching loss that establishes a one-to-one correspondence between predictions and ground-truth objects. This is the core mechanism that eliminates the need for NMS and anchors.
Given a set of N predictions and a set of ground-truth objects (padded with "no object" entries to reach N), the model finds the optimal one-to-one assignment using the Hungarian algorithm. The matching cost for pairing prediction i with ground-truth j combines:
The Hungarian algorithm finds the permutation that minimizes the total matching cost across all pairs. This bipartite matching ensures that each ground-truth object is assigned to exactly one prediction, eliminating the need for NMS or duplicate suppression. The matching is computed once per forward pass and does not produce gradients; it simply determines which prediction is responsible for which ground-truth object.
Once the optimal matching is found, the Hungarian loss is computed as the sum over all matched pairs:
The L1 and GIoU losses are complementary: L1 loss penalizes absolute coordinate errors, while GIoU loss is scale-invariant and penalizes poor overlap regardless of box size.
DETR also uses auxiliary decoding losses to improve training convergence. Prediction heads (with shared FFN parameters) are attached after each decoder layer, and the Hungarian matching and loss are computed independently at each layer. The final loss is the sum of all per-layer losses. This provides intermediate supervision to the decoder layers and helps gradients flow more effectively through the deep network.
DETR was evaluated on the COCO 2017 object detection benchmark (COCO val5k). The models were trained for 300 epochs (short schedule) or 500 epochs (long schedule), with the learning rate dropped by a factor of 10 after 200 or 400 epochs, respectively. Training used 16 V100 GPUs with a batch size of 64 (4 images per GPU) and took approximately 3 days for the 300-epoch schedule.
The following table summarizes the main results reported in the original paper (Table 1), comparing DETR with Faster R-CNN baselines. The "+" suffix indicates enhanced Faster R-CNN models trained with the 9x schedule and additional augmentations (GIoU loss, random crop training).
| Model | Backbone | GFLOPs | Params | FPS | AP | AP50 | AP75 | AP_S | AP_M | AP_L |
|---|---|---|---|---|---|---|---|---|---|---|
| Faster R-CNN-DC5 | R50 | 320 | 166M | 16 | 39.0 | 60.5 | 42.3 | 21.4 | 43.5 | 52.5 |
| Faster R-CNN-FPN | R50 | 180 | 42M | 26 | 40.2 | 61.0 | 43.8 | 24.2 | 43.5 | 52.0 |
| Faster R-CNN-FPN+ | R50 | 180 | 42M | 26 | 42.0 | 62.1 | 45.5 | 26.6 | 45.4 | 53.4 |
| Faster R-CNN-FPN | R101 | 246 | 60M | 20 | 42.0 | 62.5 | 45.9 | 25.2 | 45.6 | 54.6 |
| Faster R-CNN-FPN+ | R101 | 246 | 60M | 20 | 44.0 | 63.9 | 47.8 | 27.2 | 48.1 | 56.0 |
| DETR | R50 | 86 | 41M | 28 | 42.0 | 62.4 | 44.2 | 20.5 | 45.8 | 61.1 |
| DETR-DC5 | R50 | 187 | 41M | 12 | 43.3 | 63.1 | 45.9 | 22.5 | 47.3 | 61.1 |
| DETR | R101 | 152 | 60M | 20 | 43.5 | 63.8 | 46.4 | 21.9 | 48.0 | 61.8 |
| DETR-DC5 | R101 | 253 | 60M | 10 | 44.9 | 64.7 | 47.7 | 23.7 | 49.5 | 62.3 |
Several observations stand out from these results:
Comparable overall AP: DETR-R50 achieved 42.0 AP, matching Faster R-CNN-FPN+ with the same backbone, while using only 86 GFLOPs (less than half of Faster R-CNN-FPN's 180 GFLOPs) and a comparable 41M parameter count.
Superior large-object detection: DETR showed a dramatic improvement in AP_L. DETR-R50 achieved 61.1 AP_L versus 52.0 for Faster R-CNN-FPN, a gain of +9.1 points. This advantage comes from the transformer encoder's global self-attention, which captures long-range dependencies across the entire image.
Weaker small-object detection: DETR-R50 achieved only 20.5 AP_S compared to 24.2 for Faster R-CNN-FPN, a gap of -3.7 points. The coarse stride-32 feature map limits the model's ability to represent small objects. The DC5 variant partially mitigates this (22.5 AP_S) but at a significant computational cost.
Efficient computation: DETR-R50 runs at 28 FPS with only 86 GFLOPs, making it computationally efficient. The transformer component adds only about 41M - 23.5M = 17.5M parameters on top of the ResNet-50 backbone (23.5M parameters).
DETR was also extended to panoptic segmentation by adding a mask prediction head on top of the decoder outputs. This head generates binary masks for each detected object using a multi-head attention mechanism followed by an FPN-like architecture. DETR achieved competitive Panoptic Quality (PQ) scores on COCO val5k:
| Model | Backbone | PQ | SQ | RQ | PQ_th | PQ_st | AP_mask |
|---|---|---|---|---|---|---|---|
| DETR | R50 | 43.4 | 79.3 | 53.8 | 48.2 | 36.3 | 31.1 |
| DETR-DC5 | R50 | 44.6 | 79.8 | 55.0 | 49.4 | 37.3 | 31.9 |
| DETR | R101 | 45.1 | 79.9 | 55.5 | 50.5 | 37.0 | 33.0 |
These results demonstrated the versatility of the DETR framework, showing that the same set prediction approach could be extended to segmentation tasks with minimal architectural changes.
Despite its elegant design, the original DETR had several notable limitations:
Slow training convergence: DETR required approximately 500 epochs to fully converge on COCO, compared to only 12 to 36 epochs for Faster R-CNN. This slow convergence was largely attributed to the difficulty of learning cross-attention patterns from scratch and the instability of bipartite matching in early training stages. The cross-attention mechanism relies heavily on content embeddings, while spatial embeddings contribute minimally, increasing the demand for high-quality content embeddings and thus raising the training difficulty.
Poor performance on small objects: The coarse spatial resolution of the feature map (stride 32) made it difficult to detect small objects. While the DC5 variant (using dilated convolutions at stride 16) improved small-object performance, it came at a significant computational cost (roughly doubling GFLOPs).
Quadratic complexity of self-attention: The global self-attention mechanism has O(n^2) complexity with respect to the number of spatial positions, making it expensive for high-resolution feature maps. This prevents DETR from using multi-scale features or higher-resolution inputs without prohibitive cost.
Fixed number of object queries: The fixed N = 100 queries placed an upper limit on the number of detectable objects per image, which could be problematic for dense detection scenarios.
These limitations motivated a rich body of follow-up work that has progressively improved DETR's training efficiency, detection accuracy, and computational cost.
Since its introduction, DETR has inspired a large family of improved models. The following sections cover the most influential variants.
Deformable DETR was proposed by Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Published as an oral presentation at the International Conference on Learning Representations (ICLR) 2021, it addressed DETR's slow convergence and difficulty with multi-scale features.
The core innovation is the deformable attention module, inspired by deformable convolutions. Instead of attending to all spatial locations (as in standard self-attention), deformable attention attends to only a small set of K sampling points around a learned reference point. By default, K = 4 sampling points per attention head per feature level. The sampling offsets are predicted by a linear layer applied to the query feature, making the attention pattern data-dependent and spatially adaptive.
The multi-scale deformable attention variant extends this to operate across multiple feature map resolutions simultaneously. Given L feature map levels (typically from a multi-scale feature extractor), each attention head samples K points from each of the L levels, allowing the model to aggregate information across scales without the computational overhead of full attention over all levels.
Deformable DETR also introduced iterative bounding box refinement, where each decoder layer refines the bounding box predictions from the previous layer, and a two-stage variant that generates region proposals from the encoder output before passing them to the decoder.
Key results on COCO val2017 with a ResNet-50 backbone:
| Variant | Epochs | Params | GFLOPs | AP | AP_S | AP_M | AP_L |
|---|---|---|---|---|---|---|---|
| Deformable DETR (single-scale) | 50 | 34M | 78 | 39.4 | 20.6 | 43.0 | 55.5 |
| Deformable DETR (single-scale DC5) | 50 | 34M | 128 | 41.5 | 24.1 | 45.3 | 56.0 |
| Deformable DETR (multi-scale) | 50 | 40M | 173 | 44.5 | 27.1 | 47.6 | 59.6 |
| + Iterative box refinement | 50 | 41M | 173 | 46.2 | 28.3 | 49.2 | 61.5 |
| + Two-stage | 50 | 41M | 173 | 46.9 | 29.6 | 50.1 | 61.6 |
Deformable DETR achieved better performance than the original DETR while converging in roughly 50 epochs (compared to 500 for DETR), a 10x improvement in training efficiency. The multi-scale variant also significantly improved small-object detection, raising AP_S from 20.5 (DETR-R50) to 27.1.
Conditional DETR, by Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, and Jingdong Wang, addressed slow convergence by introducing a conditional cross-attention mechanism. Instead of learning cross-attention patterns entirely from scratch, the model conditions the spatial attention map on the decoder embedding, forming a conditional spatial query that provides a spatial prior for localization. This decouples content and spatial attention, making the cross-attention more effective from early training stages.
Conditional DETR converges 6.7x faster than DETR for ResNet-50 and ResNet-101 backbones, and 10x faster for DC5 variants. With a ResNet-50 backbone and 50 training epochs, Conditional DETR achieves 41.0 AP on COCO val, outperforming DETR trained for 50 epochs (34.8 AP) by a large margin and approaching DETR's 500-epoch result (42.0 AP).
DAB-DETR (Dynamic Anchor Boxes DETR), by Shilong Liu, Feng Li, Hao Zhang, Xiao Yang, Xianbiao Qi, Hang Su, Jun Zhu, and Lei Zhang, reinterpreted object queries as dynamic anchor boxes. Instead of using learned embeddings with no explicit geometric meaning, DAB-DETR uses 4D anchor box coordinates (x, y, w, h) as positional queries, which are dynamically updated layer by layer through the decoder. This explicit geometric parameterization made the queries more interpretable and improved performance. DAB-DETR-R50 achieves 42.2 AP at 50 epochs, and the DC5 variant with 3 anchor patterns reaches 45.7 AP.
DN-DETR (Denoising DETR), by Feng Li, Hao Zhang, Shilong Liu, Jian Guo, Lionel M. Ni, and Lei Zhang, identified the instability of bipartite matching as a primary cause of slow convergence. The solution was a denoising training approach: during training, noised versions of ground-truth bounding boxes and class labels are fed directly into the decoder alongside the regular object queries. The model learns to reconstruct the clean ground-truth from the noised inputs, stabilizing the matching process and providing a shortcut for learning. DN-DETR-R50 achieves 44.4 AP at 50 epochs, and when combined with deformable attention, reaches 48.6 AP in 12 epochs.
DINO (DETR with Improved DeNoising Anchor Boxes) was proposed by Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M. Ni, and Heung-Yeung Shum. Published at ICLR 2023, DINO built upon Deformable DETR, DAB-DETR, and DN-DETR, combining and improving their core ideas into a unified framework.
DINO introduced three key improvements:
Contrastive denoising training (CDN): Unlike DN-DETR which only used positive (noised ground-truth) samples for denoising, DINO adds both positive and negative samples. Negative samples are generated by adding larger noise to ground-truth boxes so they fall outside the matching threshold, teaching the model to explicitly reject false positives.
Mixed query selection: A hybrid approach to initializing anchor boxes, where the positional part of queries is initialized from encoder output features (top-scoring proposals), while the content part remains learned. This provides better initialization than either fully learned or fully selected queries.
Look forward twice: An improved box prediction scheme where each decoder layer refines the box prediction from the previous layer, with gradients flowing through both the current and previous layer's predictions during backpropagation.
DINO achieved the following results on COCO:
| Configuration | Backbone | Epochs | AP |
|---|---|---|---|
| DINO (4 scales) | R50 | 12 | 49.0 |
| DINO (5 scales) | R50 | 12 | 49.4 |
| DINO (4 scales) | R50 | 24 | 50.4 |
| DINO (5 scales) | R50 | 24 | 51.3 |
| DINO (4 scales) | R50 | 36 | 50.9 |
| DINO (4 scales) | Swin-L | 12 | 56.8 |
| DINO (4 scales) | Swin-L | 36 | 58.0 |
| DINO (5 scales) | Swin-L | 36 | 58.5 |
| DINO (Objects365 pretrain) | Swin-L | - | 63.2 |
DINO was the first DETR-like model to achieve state-of-the-art results on the COCO leaderboard, reaching 63.2 AP on COCO val2017 and 63.3 AP on COCO test-dev with a Swin-L backbone and Objects365 pretraining.
RT-DETR (Real-Time DEtection TRansformer) was developed by Yian Zhao, Wenyu Lv, Shangliang Xu, Jinman Wei, Guanzhong Wang, Qingqing Dang, Yi Liu, and Jie Chen from Baidu Inc. Published at CVPR 2024 with the paper title "DETRs Beat YOLOs on Real-time Object Detection," RT-DETR is the first real-time end-to-end object detector based on the DETR framework.
RT-DETR introduced two main innovations:
Efficient hybrid encoder: Instead of processing multi-scale features with a single heavyweight encoder, RT-DETR decouples intra-scale feature interaction and cross-scale feature fusion. Intra-scale interaction uses efficient attention within each scale level, while cross-scale fusion combines information across levels using a lighter-weight mechanism. This substantially reduces computational cost.
IoU-aware query selection: Rather than selecting the top-K encoder features by classification score alone, RT-DETR trains an IoU prediction branch and uses it to select queries with both high classification confidence and high localization quality.
RT-DETR also supports flexible inference speed adjustment by varying the number of decoder layers at test time, without retraining. This allows practitioners to trade accuracy for speed depending on deployment requirements.
Key performance on COCO val2017 (trained with a 6x / 72-epoch schedule):
| Model | Backbone | AP | FPS (T4) |
|---|---|---|---|
| RT-DETR-R18 | ResNet-18 | 46.5 | 217 |
| RT-DETR-R50 | ResNet-50 | 53.1 | 108 |
| RT-DETR-R101 | ResNet-101 | 54.3 | 74 |
| RT-DETR-R50 (Objects365) | ResNet-50 | 55.3 | 108 |
| RT-DETR-R101 (Objects365) | ResNet-101 | 56.2 | 74 |
RT-DETR-R50 outperformed DINO-R50 by 2.2% AP in accuracy and was roughly 21 times faster in FPS. It also surpassed advanced YOLO detectors in both speed and accuracy, marking the first time a DETR-based model matched or beat YOLO-series detectors in the real-time regime.
Co-DETR (Collaborative Hybrid Assignments Training for DETRs) was proposed by Zhuofan Zong, Guanglu Song, and Yu Liu, and presented at ICCV 2023. The paper identified that the one-to-one matching in DETR provides sparse supervision to the encoder, limiting the encoder's ability to learn discriminative features.
Co-DETR addresses this by training multiple auxiliary detection heads (such as ATSS and Faster R-CNN heads) with one-to-many label assignments alongside the primary DETR decoder during training. These auxiliary heads provide denser supervision signals to the encoder. Additionally, positive coordinates extracted from the auxiliary heads are used to create customized positive queries that improve decoder attention learning. At inference time, the auxiliary heads are discarded, so Co-DETR adds no additional computational cost.
Co-DETR achieved notable results:
The 66.0 AP on COCO test-dev made Co-DETR with ViT-L one of the highest-performing detectors at the time.
The following table provides an overview of the major DETR variants, their key innovations, and representative performance figures on COCO val2017 with a ResNet-50 backbone.
| Model | Year | Venue | Key Innovation | AP (R50) | Training Epochs |
|---|---|---|---|---|---|
| DETR | 2020 | ECCV 2020 | Set prediction with Hungarian matching | 42.0 | 500 |
| Deformable DETR | 2021 | ICLR 2021 | Multi-scale deformable attention | 46.9* | 50 |
| Conditional DETR | 2021 | ICCV 2021 | Conditional spatial cross-attention | 41.0 | 50 |
| DAB-DETR | 2022 | ICLR 2022 | Dynamic anchor boxes as queries | 42.2 | 50 |
| DN-DETR | 2022 | CVPR 2022 | Query denoising training | 44.4 | 50 |
| DINO | 2023 | ICLR 2023 | Contrastive denoising, mixed query selection | 49.0 | 12 |
| Co-DETR | 2023 | ICCV 2023 | Collaborative hybrid assignments training | 49.5 | 12 |
| RT-DETR | 2024 | CVPR 2024 | Efficient hybrid encoder, real-time inference | 53.1 | 72 |
* Deformable DETR result of 46.9 includes both iterative box refinement and two-stage variant. The base multi-scale model achieves 44.5 AP.
The following table compares DETR-family models with traditional CNN-based detectors across key dimensions.
| Feature | Faster R-CNN | YOLO (v5/v8) | SSD | DETR Family |
|---|---|---|---|---|
| Detection paradigm | Two-stage (RPN + head) | Single-stage, grid-based | Single-stage, multi-scale anchors | Set prediction with transformer |
| Anchor boxes | Yes | Yes (v5) / Anchor-free (v8) | Yes | No |
| NMS required | Yes | Yes | Yes | No |
| Feature extraction | CNN + FPN | CNN (CSPDarknet/etc.) | VGG / MobileNet | CNN backbone + transformer encoder |
| Global context | Limited (local receptive field) | Limited (local receptive field) | Limited (local receptive field) | Full global attention |
| Small object handling | Good (with FPN) | Good (multi-scale) | Moderate | Improved in later variants |
| Training epochs (COCO) | 12-36 | 300+ | 120+ | 12-500 (variant-dependent) |
| End-to-end | No (requires NMS) | No (requires NMS) | No (requires NMS) | Yes |
DETR's introduction in 2020 represented a paradigm shift in object detection research. Its key contributions and lasting impact include several areas discussed below.
DETR was the first model to demonstrate that competitive object detection could be achieved without anchors, NMS, or region proposals. This simplification reduced the engineering burden of building detection systems and eliminated many hyperparameters that required careful tuning for each dataset.
DETR arrived shortly before the Vision Transformer (ViT) paper (Dosovitskiy et al., 2020), making it one of the earliest successful applications of transformers to core computer vision tasks. DETR demonstrated that the self-attention mechanism could capture global context in images effectively, paving the way for transformer-based architectures across segmentation, tracking, pose estimation, and other vision tasks.
By removing post-processing steps like NMS, DETR created truly end-to-end detection systems that could be trained and deployed as a single differentiable pipeline. This property is particularly valuable for deployment on accelerators and edge devices, where custom post-processing operations can be difficult to optimize.
DETR has generated an extensive family of follow-up models. The DETR paper has accumulated over 16,000 citations as of 2025, reflecting its outsized influence on the field. At least 25 significant DETR variants have been published at top venues. The DETR framework has been extended to panoptic segmentation (Panoptic SegFormer, Mask2Former), instance segmentation (Mask DINO), 3D object detection (3DETR), video object detection, open-set detection (Grounding DINO), and multi-modal understanding. The detrex library from IDEA Research provides a unified research platform for DETR-based algorithms.
While the original DETR merely matched Faster R-CNN, later variants have comprehensively surpassed traditional detectors. DINO was the first DETR-like model to top the COCO leaderboard in 2022. Co-DETR pushed performance to 66.0 AP on COCO test-dev. RT-DETR demonstrated that DETR-based models could beat YOLO detectors in the real-time regime, achieving both higher accuracy and faster inference. These milestones confirm that the DETR paradigm is not merely an academic curiosity but a practical and performant approach to detection.
DETR's set prediction framework has proven highly versatile. It has been adapted for autonomous driving perception (3D detection, tracking), medical image analysis (cell detection, lesion detection), remote sensing (satellite image analysis), and video understanding (action detection, object tracking). The Grounding DINO model, which combines DINO with grounded pre-training, achieved 52.5 AP on COCO in a zero-shot setting without any COCO training data, demonstrating the framework's strength in open-vocabulary detection. Grounding DINO was published at ECCV 2024.
The original DETR implementation was released by Facebook AI Research on GitHub (github.com/facebookresearch/detr) and is written in PyTorch. The codebase is notably compact, with the core model implementation comprising roughly 50 lines of PyTorch code for the transformer components, which the authors highlighted as a key advantage of the approach.
DETR and its major variants are also available through the Hugging Face Transformers library, which provides pretrained weights and easy-to-use inference APIs. The facebook/detr-resnet-50 and facebook/detr-resnet-101 checkpoints are among the most downloaded detection models on the Hugging Face Hub. RT-DETR has been integrated into the Ultralytics framework, making it accessible to practitioners already familiar with the YOLO ecosystem.