Road to ML Engineer #52

In the last article, we covered DETR, which removed NMS with bipartite matching, and improvements in transformer encoder and decoder made by Deformable DETR and DAB DETR. Despite these improvements, DETR was still not competitive with YOLO models in terms of both performance and speed, until recent additional improvements. In this article, we will discuss those improvements that led to the competitive performance of DETR.

DN DETR

While bipartite matching eliminates NMS, it has an inherent difficulty that we are not guaranteed that the same predictions will be matched to the same ground truth boxes. Especially when a prediction is unexpectedly matched to a distant box to avoid duplicates, it creates confusing signals for the transformer and slows convergence. Li, F. et al. (2022) highlights this with visualizations in their paper (cited at the bottom of the article).

Although we cannot simply remove bipartite matching, we can generate more stable signals to aid learning. DN DETR addresses this by introducing an additional denoising task. A small set of ground truth boxes, with added noise and their corresponding class embeddings (20% of them are chosen randomly), are passed to the decoder, which is prompted to simply denoise them. The order of the ground truth boxes with noise remains the same, which provides a stable signal for the decoder to iteratively improve the anchors. The denoising loss can be computed using the bounding box loss described in the previous article.

When introducing the auxiliary denoising task, we need to apply an attention mask to prevent the decoder from using the noisy ground truth boxes for the original object detection task. DN DETR applies this attention mask to its self-attention in a way that isolates the normal queries while allowing the noisy queries to attend to themselves and the normal queries. Introducing the denoising task does not increase the inference cost and yet can increase the number of training samples with better signals, improving convergence and performance to some extent.

DINO

Although the denoising task is similar to the decoder's original task of refining decoder embeddings and anchor boxes to aid training, it differs in that it does not include rejecting nearby anchors as background. To make it more similar and improve the signals, DINO (DETR with improved denoising anchor boxes) uses a contrastive denoising approach. This approach adds heavily noisy ground truth boxes that are matched to no object class instead of their original object classes when computing the loss, in addition to the original denoising samples.

DINO implements two additional architectural modifications to DAB Deformable DETR: mixed query selection and a look forward twice approach. Mixed query selection involves the model selecting K anchors with the highest class score produced by the encoder and the heads attached to it. The idea of query selection emerged in the Deformable DETR paper, which discussed a two-stage approach using encoder predictions for both decoder embeddings and object queries. This is sensible since visualizations of vanilla DETR’s encoder self-attention already demonstrated its capability in attending to individual objects. However, DINO suspected that the class embeddings from the encoder might be ambiguous and not as refined. Therefore, DINO chose to discard the class embeddings and use static decoder embeddings while using the anchors from the encoder, similar to Faster R-CNN.

Deformable DETR used a look forward once approach, where predicted object queries from the previous layer are detached at each decoder layer. This limits the gradient of the bounding box prediction to backpropagate only directly at each layer, aiming to stabilize learning by stabilizing the gradient and focusing on improving the quality of adjustments at each decoder layer. However, DINO suspected that the quality of the anchor box prediction could be improved via a look forward twice approach, which adds gradient flow to the previous anchor box, as shown in the above visualization, along with mixed query selection.

Combining these minor modifications and training the model on the Object365 dataset, which contains more images with more boxes of more classes than COCO, DINO became the first DETR to achieve state-of-the-art object detection performance at the time of publication, with relatively fewer parameters. Recent state-of-the-art models mostly use DINO Deformable DETR as a foundation. For example, CoDETR by Zong, Z. et al. (2023), the current state-of-the-art, replaces contrastive denoising with refinements of the anchor box predictions from learnable auxiliary heads (R-CNN heads and other variants) based on the features from the encoder, providing more realistic signals for the decoder and better gradient flow for the encoder.

RT-DETR

Although the performance of DETR models has sufficiently improved by the various techniques introduced so far, they were not primarily designed for and lack the speed required for real-time object detection, which is what we often perform in practice (real-time surveillance for security, self-driving cars, automated manufacturing, etc.). Here, DETR was essentially not taking full advantage of bipartite matching, which removes the need for NMS for faster inference. To improve speed to a level suitable for real-time operation while maintaining accuracy, RT-DETR modified DINO Deformable DETR by introducing a new hybrid encoder that only applies attention to the feature map with the lowest resolution (intra-scale feature interaction [AIFI]) and combines it with the other feature maps (cross-scale feature fusion [CCFF]).

RT-DETR also identified that vanilla query selection can produce suboptimal object queries that only reflect high confidence in classification and not necessarily the quality of localization. Even though the encoder is trained to locate the selected objects, it might not generalize during inference, leaving uncertainty regarding the localization quality. To reflect this uncertainty in the score, RT-DETR uses uncertainty-minimal query selection, where the encoder is trained to produce the confidence score by adding uncertainty ( $U(X) = ||P(X) - C(X)||$ ) into the classification loss.

Here, $X$ represents the encoded features, $P(X)$ is the localization distribution for $X$ (potentially the distribution of IoU scores), and $C(X)$ is the classification distribution for $X$ . The introduction of uncertainty increased the number of object queries with high IoU scores and led to higher performance without hindering speed. Combining all the techniques for improving almost all aspects of DETR (encoder block, decoder block, and losses), RT-DETR demonstrated higher speed and performance than YOLOv8 in general for the first time as a DETR-like model.

However, RT-DETR struggles with small objects, just like vanilla DETR, possibly due to the application of attention only to the feature map with the smallest resolution. The performance is also being caught up by the second-latest YOLOv11 model (which we discussed previously in this series), which uses a similar structure that applies attention to the feature map with the smallest resolution but utilizes a neck instead of CCFF and NMS with multiple detection heads instead of bipartite matching with the decoder. YOLOv11 also outperforms RT-DETR in GFLOPs, though we cannot be sure about their FPS and latency due to the presence/absence of NMS and the lack of measurements.

Hence, my previous claim that said YOLO was no longer SOTA in object detection is slightly misleading, as it still might be the best model for real-time object detection. Regardless of whether YOLO or RT-DETR is the best, it is an interesting observation, at least in my opinion, that the two different approaches ended up with fairly similar overall structures, though the differences still cannot be understated.

Conclusion

In this article, we covered DN DETR, which introduced denoising; DINO (CoDETR), which achieved SOTA at the time with contrasting denoising, mixed query selection, and look-forward twice (+Object365), and RT-DETR, which improved speed for real-time detection (AIFI and CCFF) while maintaining accuracy (uncertainty-minimal query selection). For more details on their implementations, I recommend checking out the original papers cited below.

Resources

Ding, X. et al. 2021. RepVGG: Making VGG-style ConvNets Great Again. ArXiv.
Gaiduk, M. 2024. Reading DINO source code - DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection. YouTube.
Li, F. et al. 2022. DN-DETR: Accelerate DETR Training by Introducing Query DeNoising. ArXiv.
Zhang, H. et al. 2022. DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection. ArXiv.
Zhao, Y. et al. 2024. DETRs Beat YOLOs on Real-time Object Detection. ArXiv.
Zhu, X. et al. 2021. Deformable DETR: Deformable Transformers For End-To-End Object Detection. ICLR 2021.
Zong, Z. et al. 2023. DETRs with Collaborative Hybrid Assignments Training. ArXiv.