DETR

Computer Vision Deep Learning Transformer Models

26 min read

Updated Jul 12, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 12, 2026

Fact-checked

In review queue

Sources

12 citations

Revision

v8 · 5,277 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

DETR (DEtection TRansformer) is an end-to-end object detection model that reframes detection as a direct set prediction problem solved with a transformer encoder-decoder and bipartite matching, removing the hand-designed anchors and non-maximum suppression (NMS) used by earlier detectors. It was introduced by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko of Facebook AI Research (FAIR) and presented at the European Conference on Computer Vision (ECCV 2020).^[1] The original paper describes DETR's two core ingredients as "a set-based global loss that forces unique predictions via bipartite matching, and a transformer encoder-decoder architecture."^[1]

DETR was the first detector to match the highly optimized Faster R-CNN baseline on the COCO benchmark while removing region proposals, anchors, and NMS, with the authors reporting that it "demonstrates accuracy and run-time performance on par with the well-established and highly-optimized Faster RCNN baseline on the challenging COCO object detection dataset."^[1] DETR-R50 reached 42.0 average precision (AP) on COCO val using only 86 GFLOPs, less than half of Faster R-CNN-FPN's 180 GFLOPs.^[1] Its publication sparked a wave of follow-up research, including Deformable DETR, DINO, RT-DETR, and Co-DETR, that has reshaped the object detection landscape and shown that transformers can serve as a universal architecture for computer vision tasks.

What problem does DETR solve?

Before DETR, the dominant paradigm in object detection relied on multi-stage or single-stage pipelines built around convolutional neural networks. Two-stage detectors like Faster R-CNN first generated region proposals using a Region Proposal Network (RPN), then classified and refined each proposal independently.^[10] Single-stage detectors such as YOLO and SSD predicted bounding boxes directly from feature maps at multiple scales. Both approaches depended heavily on hand-crafted components:

Anchor boxes: predefined bounding box shapes placed at each spatial location to serve as detection priors.
Non-maximum suppression (NMS): a post-processing step that removes duplicate detections by suppressing overlapping boxes.
Feature Pyramid Networks (FPN): multi-scale feature extraction structures that handle objects of different sizes.

These components required careful tuning of hyperparameters (anchor sizes, aspect ratios, NMS thresholds, IoU thresholds) and introduced implicit assumptions about object geometry. The DETR authors argued that a simpler approach was possible: directly predict a fixed-size set of detections using a transformer, then match predictions to ground truth using bipartite matching. As the paper states, this design "streamlines the detection pipeline, effectively removing the need for many hand-designed components like a non-maximum suppression procedure or anchor generation."^[1]

The set prediction formulation was inspired by earlier work on set-based losses for multi-label classification and keypoint detection, but DETR was the first to apply it successfully at scale for general object detection.^[1]

How does the DETR architecture work?

DETR's architecture consists of three main components: a CNN backbone for feature extraction, a transformer encoder-decoder for global reasoning, and feed-forward network (FFN) prediction heads for classification and bounding box regression.^[1]

CNN Backbone

The backbone is a standard ResNet (either ResNet-50 or ResNet-101) pretrained on ImageNet. Given an input image of size $3 \times H \times W$ , the backbone produces a lower-resolution feature map of shape $C \times H/32 \times W/32$ , where $C$ is typically 2048 for ResNet. A $1 \times 1$ convolution then reduces the channel dimension from $C$ to a smaller dimension $d$ (256 by default), producing a feature map of shape $d \times H/32 \times W/32$ . This feature map is flattened into a sequence of d-dimensional feature vectors, each corresponding to a spatial location in the downsampled image.^[1]

The authors also experimented with a DC5 (dilated C5) variant, which uses dilated convolutions in the final stage of ResNet to produce a stride-16 feature map instead of stride-32. This increases spatial resolution at the cost of higher computation, improving performance on small objects.^[1]

Positional Encoding

Since transformers are permutation-invariant, spatial positional information must be explicitly added. DETR supplements the flattened feature sequence with fixed sinusoidal positional encodings, similar to those used in the original Transformer model for natural language processing.^[11] These encodings are two-dimensional, separately encoding the x and y coordinates of each spatial position using sine and cosine functions with different frequencies. The positional encodings are added to the input at every attention layer in the encoder, ensuring that spatial information is preserved throughout the network.^[1]

Transformer Encoder

The encoder is a standard transformer encoder with multi-head self-attention and feed-forward layers.^[11] It takes the sequence of positional-encoded image features as input and applies global self-attention, allowing every spatial location to attend to every other location. This global reasoning is critical for disentangling objects, especially in scenes with occlusion or overlapping instances. The default encoder configuration uses 6 layers, each with 8 attention heads, a hidden dimension of 256, and a feed-forward network (FFN) hidden dimension of 2048, with a dropout rate of 0.1.^[1]

The encoder's self-attention has $O(n^2)$ complexity relative to the number of spatial positions $n$ , which is the product of the downsampled height and width. For a typical 800x1066 input image, the feature map is roughly 25x34, yielding about 850 tokens.

Transformer Decoder

The decoder follows the standard transformer decoder architecture, consisting of multi-head self-attention, multi-head cross-attention (attending to encoder outputs), and feed-forward layers. The key difference from standard sequence-to-sequence transformers is that DETR decodes all $N$ objects in parallel rather than autoregressively. The decoder takes as input a set of $N$ learned positional embeddings called object queries. The default decoder uses 6 layers, matching the encoder configuration.^[1]

What are object queries in DETR?

Object queries are a fixed set of $N$ learned embedding vectors (where $N = 100$ by default) that serve as input slots for the decoder. Each object query is responsible for detecting at most one object. During training, these queries learn to specialize in attending to different regions and object types through the self-attention and cross-attention mechanisms. The object queries do not correspond to specific locations or regions in the image at initialization; they are randomly initialized learned parameters that develop spatial and semantic specialization through training.^[1]

The $N$ output embeddings from the decoder are independently decoded into box coordinates and class labels by the prediction heads. If fewer than $N$ objects are present in the image, the remaining slots are assigned a special "no object" (background) class.^[1]

FFN Prediction Heads

Each decoder output embedding is passed through two separate feed-forward networks:

A classification head: a linear layer that predicts the class label (including the "no object" class) for each detection slot.
A bounding box head: a 3-layer FFN with ReLU activations and hidden dimension d that predicts the normalized center coordinates $(c_x, c_y)$ , height, and width of the bounding box.

The bounding box predictions use normalized coordinates relative to the image dimensions, eliminating the need for anchor-based parameterization.^[1]

How does Hungarian matching remove the need for NMS?

DETR's training procedure relies on a bipartite matching loss that establishes a one-to-one correspondence between predictions and ground-truth objects. This is the core mechanism that eliminates the need for NMS and anchors.^[1]

Bipartite Matching

Given a set of N predictions and a set of ground-truth objects (padded with "no object" entries to reach N), the model finds the optimal one-to-one assignment using the Hungarian algorithm.^[12] The matching cost for pairing prediction $i$ with ground-truth $j$ combines:

Classification cost: 1 minus the predicted probability for the target class (or a special cost for the "no object" class).
L1 distance: the L1 distance between the predicted and ground-truth bounding box coordinates.
GIoU cost: the generalized Intersection over Union (GIoU) cost between predicted and ground-truth boxes.

The Hungarian algorithm finds the permutation that minimizes the total matching cost across all pairs.^[12] This bipartite matching ensures that each ground-truth object is assigned to exactly one prediction, eliminating the need for NMS or duplicate suppression. The matching is computed once per forward pass and does not produce gradients; it simply determines which prediction is responsible for which ground-truth object.^[1]

Training Loss

Once the optimal matching is found, the Hungarian loss is computed as the sum over all matched pairs:

Cross-entropy loss for classification (with a down-weighted factor for the "no object" class to handle class imbalance).
L1 loss for bounding box coordinate regression, weighted by $\lambda_{\text{L1}} = 5$ .
GIoU loss for bounding box overlap quality, weighted by $\lambda_{\text{iou}} = 2$ .

The L1 and GIoU losses are complementary: L1 loss penalizes absolute coordinate errors, while GIoU loss is scale-invariant and penalizes poor overlap regardless of box size.^[1]

Auxiliary Decoding Losses

DETR also uses auxiliary decoding losses to improve training convergence. Prediction heads (with shared FFN parameters) are attached after each decoder layer, and the Hungarian matching and loss are computed independently at each layer. The final loss is the sum of all per-layer losses. This provides intermediate supervision to the decoder layers and helps gradients flow more effectively through the deep network.^[1]

How well does DETR perform on COCO?

DETR was evaluated on the COCO 2017 object detection benchmark (COCO val5k). The models were trained for 300 epochs (short schedule) or 500 epochs (long schedule), with the learning rate dropped by a factor of 10 after 200 or 400 epochs, respectively. Training used 16 V100 GPUs with a batch size of 64 (4 images per GPU) and took approximately 3 days for the 300-epoch schedule.^[1]

The following table summarizes the main results reported in the original paper (Table 1), comparing DETR with Faster R-CNN baselines. The "+" suffix indicates enhanced Faster R-CNN models trained with the 9x schedule and additional augmentations (GIoU loss, random crop training).^[1]

Model	Backbone	GFLOPs	Params	FPS	AP	AP50	AP75	AP_S	AP_M	AP_L
Faster R-CNN-DC5	R50	320	166M	16	39.0	60.5	42.3	21.4	43.5	52.5
Faster R-CNN-FPN	R50	180	42M	26	40.2	61.0	43.8	24.2	43.5	52.0
Faster R-CNN-FPN+	R50	180	42M	26	42.0	62.1	45.5	26.6	45.4	53.4
Faster R-CNN-FPN	R101	246	60M	20	42.0	62.5	45.9	25.2	45.6	54.6
Faster R-CNN-FPN+	R101	246	60M	20	44.0	63.9	47.8	27.2	48.1	56.0
DETR	R50	86	41M	28	42.0	62.4	44.2	20.5	45.8	61.1
DETR-DC5	R50	187	41M	12	43.3	63.1	45.9	22.5	47.3	61.1
DETR	R101	152	60M	20	43.5	63.8	46.4	21.9	48.0	61.8
DETR-DC5	R101	253	60M	10	44.9	64.7	47.7	23.7	49.5	62.3

Several observations stand out from these results:

Comparable overall AP: DETR-R50 achieved 42.0 AP, matching Faster R-CNN-FPN+ with the same backbone, while using only 86 GFLOPs (less than half of Faster R-CNN-FPN's 180 GFLOPs) and a comparable 41M parameter count.^[1]
Superior large-object detection: DETR showed a dramatic improvement in AP_L. DETR-R50 achieved 61.1 AP_L versus 52.0 for Faster R-CNN-FPN, a gain of +9.1 points. This advantage comes from the transformer encoder's global self-attention, which captures long-range dependencies across the entire image.^[1]
Weaker small-object detection: DETR-R50 achieved only 20.5 AP_S compared to 24.2 for Faster R-CNN-FPN, a gap of -3.7 points. The coarse stride-32 feature map limits the model's ability to represent small objects. The DC5 variant partially mitigates this (22.5 AP_S) but at a significant computational cost.^[1]
Efficient computation: DETR-R50 runs at 28 FPS with only 86 GFLOPs, making it computationally efficient. The transformer component adds only about 41M - 23.5M = 17.5M parameters on top of the ResNet-50 backbone (23.5M parameters).^[1]

Panoptic Segmentation

DETR was also extended to panoptic segmentation by adding a mask prediction head on top of the decoder outputs. This head generates binary masks for each detected object using a multi-head attention mechanism followed by an FPN-like architecture. DETR achieved competitive Panoptic Quality (PQ) scores on COCO val5k:^[1]

Model	Backbone	PQ	SQ	RQ	PQ_th	PQ_st	AP_mask
DETR	R50	43.4	79.3	53.8	48.2	36.3	31.1
DETR-DC5	R50	44.6	79.8	55.0	49.4	37.3	31.9
DETR	R101	45.1	79.9	55.5	50.5	37.0	33.0

These results demonstrated the versatility of the DETR framework, showing that the same set prediction approach could be extended to segmentation tasks with minimal architectural changes.^[1]

What are the limitations of the original DETR?

Despite its elegant design, the original DETR had several notable limitations:

Slow training convergence: DETR required approximately 500 epochs to fully converge on COCO, compared to only 12 to 36 epochs for Faster R-CNN.^[1] This slow convergence was largely attributed to the difficulty of learning cross-attention patterns from scratch and the instability of bipartite matching in early training stages.^[5] The cross-attention mechanism relies heavily on content embeddings, while spatial embeddings contribute minimally, increasing the demand for high-quality content embeddings and thus raising the training difficulty.^[3]
Poor performance on small objects: The coarse spatial resolution of the feature map (stride 32) made it difficult to detect small objects. While the DC5 variant (using dilated convolutions at stride 16) improved small-object performance, it came at a significant computational cost (roughly doubling GFLOPs).^[1]
Quadratic complexity of self-attention: The global self-attention mechanism has $O(n^2)$ complexity with respect to the number of spatial positions, making it expensive for high-resolution feature maps. This prevents DETR from using multi-scale features or higher-resolution inputs without prohibitive cost.^[2]
Fixed number of object queries: The fixed $N = 100$ queries placed an upper limit on the number of detectable objects per image, which could be problematic for dense detection scenarios.^[1]

These limitations motivated a rich body of follow-up work that has progressively improved DETR's training efficiency, detection accuracy, and computational cost.

What are the main DETR variants?

Since its introduction, DETR has inspired a large family of improved models. The following sections cover the most influential variants.

Deformable DETR (ICLR 2021)

Deformable DETR was proposed by Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Published as an oral presentation at the International Conference on Learning Representations (ICLR) 2021, it addressed DETR's slow convergence and difficulty with multi-scale features.^[2]

The core innovation is the deformable attention module, inspired by deformable convolutions. Instead of attending to all spatial locations (as in standard self-attention), deformable attention attends to only a small set of $K$ sampling points around a learned reference point. By default, $K = 4$ sampling points per attention head per feature level. The sampling offsets are predicted by a linear layer applied to the query feature, making the attention pattern data-dependent and spatially adaptive.^[2]

The multi-scale deformable attention variant extends this to operate across multiple feature map resolutions simultaneously. Given $L$ feature map levels (typically from a multi-scale feature extractor), each attention head samples $K$ points from each of the $L$ levels, allowing the model to aggregate information across scales without the computational overhead of full attention over all levels.^[2]

Deformable DETR also introduced iterative bounding box refinement, where each decoder layer refines the bounding box predictions from the previous layer, and a two-stage variant that generates region proposals from the encoder output before passing them to the decoder.^[2]

Key results on COCO val2017 with a ResNet-50 backbone:

Variant	Epochs	Params	GFLOPs	AP	AP_S	AP_M	AP_L
Deformable DETR (single-scale)	50	34M	78	39.4	20.6	43.0	55.5
Deformable DETR (single-scale DC5)	50	34M	128	41.5	24.1	45.3	56.0
Deformable DETR (multi-scale)	50	40M	173	44.5	27.1	47.6	59.6
+ Iterative box refinement	50	41M	173	46.2	28.3	49.2	61.5
+ Two-stage	50	41M	173	46.9	29.6	50.1	61.6

Deformable DETR achieved better performance than the original DETR while converging in roughly 50 epochs (compared to 500 for DETR), a 10x improvement in training efficiency. The multi-scale variant also significantly improved small-object detection, raising AP_S from 20.5 (DETR-R50) to 27.1.^[2]

Conditional DETR (ICCV 2021)

Conditional DETR, by Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, and Jingdong Wang, addressed slow convergence by introducing a conditional cross-attention mechanism. Instead of learning cross-attention patterns entirely from scratch, the model conditions the spatial attention map on the decoder embedding, forming a conditional spatial query that provides a spatial prior for localization. This decouples content and spatial attention, making the cross-attention more effective from early training stages.^[3]

Conditional DETR converges 6.7x faster than DETR for ResNet-50 and ResNet-101 backbones, and 10x faster for DC5 variants. With a ResNet-50 backbone and 50 training epochs, Conditional DETR achieves 41.0 AP on COCO val, outperforming DETR trained for 50 epochs (34.8 AP) by a large margin and approaching DETR's 500-epoch result (42.0 AP).^[3]

DAB-DETR (ICLR 2022)

DAB-DETR (Dynamic Anchor Boxes DETR), by Shilong Liu, Feng Li, Hao Zhang, Xiao Yang, Xianbiao Qi, Hang Su, Jun Zhu, and Lei Zhang, reinterpreted object queries as dynamic anchor boxes. Instead of using learned embeddings with no explicit geometric meaning, DAB-DETR uses 4D anchor box coordinates $(x, y, w, h)$ as positional queries, which are dynamically updated layer by layer through the decoder. This explicit geometric parameterization made the queries more interpretable and improved performance. DAB-DETR-R50 achieves 42.2 AP at 50 epochs, and the DC5 variant with 3 anchor patterns reaches 45.7 AP.^[4]

DN-DETR (CVPR 2022)

DN-DETR (Denoising DETR), by Feng Li, Hao Zhang, Shilong Liu, Jian Guo, Lionel M. Ni, and Lei Zhang, identified the instability of bipartite matching as a primary cause of slow convergence. The solution was a denoising training approach: during training, noised versions of ground-truth bounding boxes and class labels are fed directly into the decoder alongside the regular object queries. The model learns to reconstruct the clean ground-truth from the noised inputs, stabilizing the matching process and providing a shortcut for learning. DN-DETR-R50 achieves 44.4 AP at 50 epochs, and when combined with deformable attention, reaches 48.6 AP in 12 epochs.^[5]

DINO (ICLR 2023)

DINO (DETR with Improved DeNoising Anchor Boxes) was proposed by Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M. Ni, and Heung-Yeung Shum. Published at ICLR 2023, DINO built upon Deformable DETR, DAB-DETR, and DN-DETR, combining and improving their core ideas into a unified framework.^[6]

DINO introduced three key improvements:

Contrastive denoising training (CDN): Unlike DN-DETR which only used positive (noised ground-truth) samples for denoising, DINO adds both positive and negative samples. Negative samples are generated by adding larger noise to ground-truth boxes so they fall outside the matching threshold, teaching the model to explicitly reject false positives.^[6]
Mixed query selection: A hybrid approach to initializing anchor boxes, where the positional part of queries is initialized from encoder output features (top-scoring proposals), while the content part remains learned. This provides better initialization than either fully learned or fully selected queries.^[6]
Look forward twice: An improved box prediction scheme where each decoder layer refines the box prediction from the previous layer, with gradients flowing through both the current and previous layer's predictions during backpropagation.^[6]

DINO achieved the following results on COCO:

Configuration	Backbone	Epochs	AP
DINO (4 scales)	R50	12	49.0
DINO (5 scales)	R50	12	49.4
DINO (4 scales)	R50	24	50.4
DINO (5 scales)	R50	24	51.3
DINO (4 scales)	R50	36	50.9
DINO (4 scales)	Swin-L	12	56.8
DINO (4 scales)	Swin-L	36	58.0
DINO (5 scales)	Swin-L	36	58.5
DINO (Objects365 pretrain)	Swin-L	-	63.2

DINO was the first DETR-like model to achieve state-of-the-art results on the COCO leaderboard, reaching 63.2 AP on COCO val2017 and 63.3 AP on COCO test-dev with a Swin-L backbone and Objects365 pretraining.^[6]

RT-DETR (CVPR 2024)

RT-DETR (Real-Time DEtection TRansformer) was developed by Yian Zhao, Wenyu Lv, Shangliang Xu, Jinman Wei, Guanzhong Wang, Qingqing Dang, Yi Liu, and Jie Chen from Baidu Inc. Published at CVPR 2024 with the paper title "DETRs Beat YOLOs on Real-time Object Detection," RT-DETR is, in the authors' words, "the first real-time end-to-end object detector to our best knowledge."^[7]

RT-DETR introduced two main innovations:

Efficient hybrid encoder: Instead of processing multi-scale features with a single heavyweight encoder, RT-DETR decouples intra-scale feature interaction and cross-scale feature fusion. Intra-scale interaction uses efficient attention within each scale level, while cross-scale fusion combines information across levels using a lighter-weight mechanism. This substantially reduces computational cost.^[7]
IoU-aware query selection: Rather than selecting the top-K encoder features by classification score alone, RT-DETR trains an IoU prediction branch and uses it to select queries with both high classification confidence and high localization quality.^[7]

RT-DETR also supports flexible inference speed adjustment by varying the number of decoder layers at test time, without retraining. This allows practitioners to trade accuracy for speed depending on deployment requirements.^[7]

Key performance on COCO val2017 (trained with a 6x / 72-epoch schedule):

Model	Backbone	AP	FPS (T4)
RT-DETR-R18	ResNet-18	46.5	217
RT-DETR-R50	ResNet-50	53.1	108
RT-DETR-R101	ResNet-101	54.3	74
RT-DETR-R50 (Objects365)	ResNet-50	55.3	108
RT-DETR-R101 (Objects365)	ResNet-101	56.2	74

RT-DETR-R50 outperformed DINO-R50 by 2.2% AP in accuracy and was roughly 21 times faster in FPS. It also surpassed advanced YOLO detectors in both speed and accuracy, marking the first time a DETR-based model matched or beat YOLO-series detectors in the real-time regime.^[7]

Co-DETR (ICCV 2023)

Co-DETR (Collaborative Hybrid Assignments Training for DETRs) was proposed by Zhuofan Zong, Guanglu Song, and Yu Liu, and presented at ICCV 2023. The paper identified that the one-to-one matching in DETR provides sparse supervision to the encoder, limiting the encoder's ability to learn discriminative features.^[8]

Co-DETR addresses this by training multiple auxiliary detection heads (such as ATSS and Faster R-CNN heads) with one-to-many label assignments alongside the primary DETR decoder during training. These auxiliary heads provide denser supervision signals to the encoder. Additionally, positive coordinates extracted from the auxiliary heads are used to create customized positive queries that improve decoder attention learning. At inference time, the auxiliary heads are discarded, so Co-DETR adds no additional computational cost.^[8]

Co-DETR achieved notable results:

Co-Deformable-DETR with R50, 12 epochs: 49.5 AP on COCO val.
With a Swin-L backbone: improved DINO-Deformable-DETR from 58.5 to 59.5 AP.
With a ViT-L backbone: 66.0 AP on COCO test-dev.

The 66.0 AP on COCO test-dev made Co-DETR with ViT-L one of the highest-performing detectors at the time.^[8]

DETR Variants Comparison Table

The following table provides an overview of the major DETR variants, their key innovations, and representative performance figures on COCO val2017 with a ResNet-50 backbone.

Model	Year	Venue	Key Innovation	AP (R50)	Training Epochs
DETR	2020	ECCV 2020	Set prediction with Hungarian matching	42.0	500
Deformable DETR	2021	ICLR 2021	Multi-scale deformable attention	46.9*	50
Conditional DETR	2021	ICCV 2021	Conditional spatial cross-attention	41.0	50
DAB-DETR	2022	ICLR 2022	Dynamic anchor boxes as queries	42.2	50
DN-DETR	2022	CVPR 2022	Query denoising training	44.4	50
DINO	2023	ICLR 2023	Contrastive denoising, mixed query selection	49.0	12
Co-DETR	2023	ICCV 2023	Collaborative hybrid assignments training	49.5	12
RT-DETR	2024	CVPR 2024	Efficient hybrid encoder, real-time inference	53.1	72

* Deformable DETR result of 46.9 includes both iterative box refinement and two-stage variant. The base multi-scale model achieves 44.5 AP.

How does DETR differ from traditional detectors?

The following table compares DETR-family models with traditional CNN-based detectors across key dimensions.

Feature	Faster R-CNN	YOLO (v5/v8)	SSD	DETR Family
Detection paradigm	Two-stage (RPN + head)	Single-stage, grid-based	Single-stage, multi-scale anchors	Set prediction with transformer
Anchor boxes	Yes	Yes (v5) / Anchor-free (v8)	Yes	No
NMS required	Yes	Yes	Yes	No
Feature extraction	CNN + FPN	CNN (CSPDarknet/etc.)	VGG / MobileNet	CNN backbone + transformer encoder
Global context	Limited (local receptive field)	Limited (local receptive field)	Limited (local receptive field)	Full global attention
Small object handling	Good (with FPN)	Good (multi-scale)	Moderate	Improved in later variants
Training epochs (COCO)	12-36	300+	120+	12-500 (variant-dependent)
End-to-end	No (requires NMS)	No (requires NMS)	No (requires NMS)	Yes

What impact did DETR have on the detection field?

DETR's introduction in 2020 represented a paradigm shift in object detection research. Its key contributions and lasting impact include several areas discussed below.

Eliminating Hand-Crafted Components

DETR was the first model to demonstrate that competitive object detection could be achieved without anchors, NMS, or region proposals.^[1] This simplification reduced the engineering burden of building detection systems and eliminated many hyperparameters that required careful tuning for each dataset.

Bringing Transformers to Object Detection

DETR arrived shortly before the Vision Transformer (ViT) paper (Dosovitskiy et al., 2020), making it one of the earliest successful applications of transformers to core computer vision tasks. DETR demonstrated that the self-attention mechanism could capture global context in images effectively, paving the way for transformer-based architectures across segmentation, tracking, pose estimation, and other vision tasks.

Enabling End-to-End Detection Pipelines

By removing post-processing steps like NMS, DETR created truly end-to-end detection systems that could be trained and deployed as a single differentiable pipeline.^[1] This property is particularly valuable for deployment on accelerators and edge devices, where custom post-processing operations can be difficult to optimize.

Spawning a Research Ecosystem

DETR has generated an extensive family of follow-up models. The DETR paper has accumulated over 16,000 citations as of 2025, reflecting its outsized influence on the field. At least 25 significant DETR variants have been published at top venues. The DETR framework has been extended to panoptic segmentation (Panoptic SegFormer, Mask2Former), instance segmentation (Mask DINO), 3D object detection (3DETR), video object detection, open-set detection (Grounding DINO), and multi-modal understanding.^[9] The detrex library from IDEA Research provides a unified research platform for DETR-based algorithms.

Reaching State-of-the-Art Performance

While the original DETR merely matched Faster R-CNN, later variants have comprehensively surpassed traditional detectors. DINO was the first DETR-like model to top the COCO leaderboard in 2022.^[6] Co-DETR pushed performance to 66.0 AP on COCO test-dev.^[8] RT-DETR demonstrated that DETR-based models could beat YOLO detectors in the real-time regime, achieving both higher accuracy and faster inference.^[7] These milestones confirm that the DETR paradigm is not merely an academic curiosity but a practical and performant approach to detection.

Applications Beyond Object Detection

DETR's set prediction framework has proven highly versatile. It has been adapted for autonomous driving perception (3D detection, tracking), medical image analysis (cell detection, lesion detection), remote sensing (satellite image analysis), and video understanding (action detection, object tracking). The Grounding DINO model, which combines DINO with grounded pre-training, achieved 52.5 AP on COCO in a zero-shot setting without any COCO training data, demonstrating the framework's strength in open-vocabulary detection. Grounding DINO was published at ECCV 2024.^[9]

Implementation and Code

The original DETR implementation was released by Facebook AI Research on GitHub (github.com/facebookresearch/detr) and is written in PyTorch. The codebase is notably compact, with the core model implementation comprising roughly 50 lines of PyTorch code for the transformer components, which the authors highlighted as a key advantage of the approach.^[1]

DETR and its major variants are also available through the Hugging Face Transformers library, which provides pretrained weights and easy-to-use inference APIs. The facebook/detr-resnet-50 and facebook/detr-resnet-101 checkpoints are among the most downloaded detection models on the Hugging Face Hub. RT-DETR has been integrated into the Ultralytics framework, making it accessible to practitioners already familiar with the YOLO ecosystem.

References

Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). "End-to-End Object Detection with Transformers." *European Conference on Computer Vision (ECCV 2020)*. Springer, LNCS vol. 12346, pp. 213-229. arXiv:2005.12872. ↩
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., & Dai, J. (2021). "Deformable DETR: Deformable Transformers for End-to-End Object Detection." *International Conference on Learning Representations (ICLR 2021)*. arXiv:2010.04159. ↩
Meng, D., Chen, X., Fan, Z., Zeng, G., Li, H., Yuan, Y., Sun, L., & Wang, J. (2021). "Conditional DETR for Fast Training Convergence." *IEEE/CVF International Conference on Computer Vision (ICCV 2021)*. arXiv:2108.06152. ↩
Liu, S., Li, F., Zhang, H., Yang, X., Qi, X., Su, H., Zhu, J., & Zhang, L. (2022). "DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR." *International Conference on Learning Representations (ICLR 2022)*. arXiv:2201.12329. ↩
Li, F., Zhang, H., Liu, S., Guo, J., Ni, L. M., & Zhang, L. (2022). "DN-DETR: Accelerate DETR Training by Introducing Query DeNoising." *IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022)*. arXiv:2203.01305. ↩
Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, L. M., & Shum, H.-Y. (2023). "DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection." *International Conference on Learning Representations (ICLR 2023)*. arXiv:2203.03605. ↩
Zhao, Y., Lv, W., Xu, S., Wei, J., Wang, G., Dang, Q., Liu, Y., & Chen, J. (2024). "DETRs Beat YOLOs on Real-time Object Detection." *IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024)*. arXiv:2304.08069. ↩
Zong, Z., Song, G., & Liu, Y. (2023). "DETRs with Collaborative Hybrid Assignments Training." *IEEE/CVF International Conference on Computer Vision (ICCV 2023)*. arXiv:2211.12860. ↩
Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Li, C., Yang, J., Su, H., Zhu, J., & Zhang, L. (2024). "Grounding DINO: Marrying DINO with Grounded Pre-training for Open-Set Object Detection." *European Conference on Computer Vision (ECCV 2024)*. arXiv:2303.05499. ↩
Ren, S., He, K., Girshick, R., & Sun, J. (2015). "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks." *Advances in Neural Information Processing Systems (NeurIPS 2015)*. ↩
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). "Attention Is All You Need." *Advances in Neural Information Processing Systems (NeurIPS 2017)*. ↩
Kuhn, H. W. (1955). "The Hungarian Method for the Assignment Problem." *Naval Research Logistics Quarterly*, 2(1-2), 83-97. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

7 revisions by 1 contributors · full history

Suggest edit

What links here

Action Chunking with Transformers (ACT)Autonomous driving Bounding Box COCO dataset Faster R-CNN Florence-2 Focal loss Grad-CAM Grounding (artificial intelligence)Image Recognition Instance segmentation Object detection SAM 2 SSD (Single Shot MultiBox Detector)Swin Transformer Transfer Learning Transformers Vision Transformer YOLO (object detection)

What problem does DETR solve?

How does the DETR architecture work?

CNN Backbone

Positional Encoding

Transformer Encoder

Transformer Decoder

What are object queries in DETR?

FFN Prediction Heads

How does Hungarian matching remove the need for NMS?

Bipartite Matching

Training Loss

Auxiliary Decoding Losses

How well does DETR perform on COCO?

Panoptic Segmentation

What are the limitations of the original DETR?

What are the main DETR variants?

Deformable DETR (ICLR 2021)

Conditional DETR (ICCV 2021)

DAB-DETR (ICLR 2022)

DN-DETR (CVPR 2022)

DINO (ICLR 2023)

RT-DETR (CVPR 2024)

Co-DETR (ICCV 2023)

DETR Variants Comparison Table

How does DETR differ from traditional detectors?

What impact did DETR have on the detection field?

Eliminating Hand-Crafted Components

Bringing Transformers to Object Detection

Enabling End-to-End Detection Pipelines

Spawning a Research Ecosystem

Reaching State-of-the-Art Performance

Applications Beyond Object Detection

Implementation and Code

See Also

References

Improve this article

Related Articles

DeiT

Swin Transformer

Masked autoencoder (MAE)

Hiera

Multi-head Latent Attention

Multi-Head Self-Attention

What links here

Related Articles

DeiT

Swin Transformer

Masked autoencoder (MAE)

Hiera

Multi-head Latent Attention

Multi-Head Self-Attention

What links here