Road to ML Engineer #51 - Detection Transformers

In the last article, we discussed YOLOv11, which introduced attention modules to process the high-level feature maps produced by the backbone for improved performance. However, YOLO models are not truly end-to-end and rely on NMS, which is not ideal for speed and performance. In this article, we will introduce alternative approaches that aim to address these issues and improve the performance and speed of object detection.

Detection Transformers

To eliminate NMS from the model, we need to create a new model where duplicate predictions on the same objects are discouraged. YOLOv11 used multiple CNN-based heads, which can only attend to local features, potentially leading to duplicate predictions. Therefore, we can consider replacing these with transformers, which model pairwise interactions between features to produce a set of predictions rather than individual predictions, making them more suited for removing duplicates.

Although transformers are not computationally efficient, we can mitigate this inefficiency by using them on feature map(s) from the backbone with small resolution(s), just as in YOLOv11. Furthermore, with the increasing availability of data, the advancements in hardware capabilities, and the inherent parallelizability of transformers, we can expect to be able to efficiently train and scale them with less inductive bias to achieve high performance, similar to what's been seen in other fields. However, because transformers output a set of predictions, we need to match individual predictions in the set to the ground truths in a one-to-one manner (bipartite matching) to properly compute the loss and discourage duplication during training.

L_{\text{match}}(y_i, \hat{y}_{\sigma(i)}) = -1_{\{c_i=\varnothing\}}\hat{p}_{\sigma(i)}(c_i) + 1_{\{c_i=\varnothing\}}L_{\text{box}}(b_i, \hat{b}_{\sigma(i)}) \\ L_{\text{Hungarian}}(y, \hat{y}) = \sum_{i=1}^{N} -log(\hat{p}_{\hat{\sigma}(i)}(c_i)) + 1_{\{c_i=\varnothing\}}L_{\text{box}}(b_i, \hat{b}_{\hat{\sigma}(i)}) \\ L_{\text{box}}(b_i, \hat{b}_{\sigma(i)}) = \lambda_{\text{iou}}L_{\text{iou}}(b_i, \hat{b}_{\sigma(i)}) + \lambda_{\text{L1}}||b_i - \hat{b}_{\sigma(i)}||

Here, $\varnothing$ represents the absence of an object class, $\sigma(i)$ is the index corresponding to the ith ground truth in permutation $\sigma$ , and $\hat{\sigma}$ is the optimal permutation found. We can match the predictions and ground truths in a way that minimizes a matching loss, $L_{\text{match}}$ , consisting of class and bounding box losses. The optimal matching can be achieved in $O(n^3)$ using the Hungarian algorithm (You can see a detailed explanation in a video titled, The Munkres Assignment Algorithm (Hungarian Algorithm), by CompSci (2016)).

After matching the predictions and ground truths, we can compute the loss to backpropagate, called the Hungarian loss, $L_{\text{Hungarian}}$ , which combines class and bounding box losses. For the bounding box loss, we can use a linear combination of IoU loss (which is scale-invariant) and MAE (L1) to avoid the scaling issue of directly predicting the box, where larger boxes automatically result in larger MAE values.

The above visualization shows the architecture of a detection transformer (DETR) that uses traditional encoder-decoder transformers for the head on top of the CNN backbone, along with the matching and losses described above. Unlike traditional transformers used for natural language tasks, however, the DETR decoder processes learnable, data-agnostic object queries attached to decoder embeddings (initialized to zeros) in parallel. The output sequence is divided into pairs of embeddings for bounding box and class predictions by the corresponding feedforward neural networks (pairs are processed independently).

Results & Implications

In the original paper (cited at the bottom of the article), we can observe that DETR achieves competitive results with well-optimized Faster R-CNN on the COCO dataset, despite having no custom components. The visualization of encoder self-attention in the original paper reveals the encoder's capability in attending to individual objects, including those overlapping, across separate heads. The original paper also discusses the natural extension of end-to-end DETR to panoptic segmentation tasks, where we perform pixel-wise classification while discriminating different object instances of the same class.

Just as we added mask heads to Faster R-CNN to create Mask R-CNN, we can add a mask head to process the decoded sequence, produce masked logits, take a pixel-wise argmax, and project it back to the original image size. (More on panoptic segmentation might be covered in the future.) However, DETR is observed to perform poorly when detecting small objects compared to its performance with larger objects, resulting in inferior overall performance compared to YOLO models. This inferior performance can be attributed to the simplicity of DETR, which only uses the feature map with the smallest resolution, unlike YOLO, to alleviate the computational cost of self-attention. Additionally, DETR exhibits slower convergence, likely due to its low inductive bias on visual data.

Deformable DETR

When analyzing self-attention, we often realize that the model need to attend only to a small number of features are needed. However, we initialize by attending to all features equally, and the model continues to attend to all features with different weights, contributing to slower convergence and a high computational cost, which is a problem that DETR suffers from. To address this, we can introduce an efficient attention mechanism that flexibly attends to a limited number of important features, using deformable attention (inspired by deformable convolution).

The above conceptually illustrates the computations of multihead deformable attention. A query can be passed to linear layers to predict offsets to predefined reference points and attention weights. The dimension of these offsets and weights is predetermined to be a small number like 3, and they are produced for each heads. Simultaneously, the feature maps can be produced by a linear layer. Then, the offsets can be applied to the reference points to obtain values at those locations on the feature maps. Finally, we can compute the dot product of the attention weights and values, and adjust the dimension with another linear layer to produce the output.

For sampling values with offsets in floating points, we use bilinear interpolation, which takes a weighted sum of the nearest 4 samples based on the distances between them. This can be done by considering the areas between the points. The above illustrates the bilinear interpolation operation used in this example. We typically use bilinear interpolation as a differentiable way of upscaling images (PyTorch has an interpolate function that supports upscaling with bilinear interpolation and other methods). The source code implementation uses CUDA C++ (which I might cover in the future) for parallel and fast bilinear interpolation on the embeddings.

Although the concept is relatively easy to comprehend, the implementation of multihead deformable attention is quite complicated. This is because we need to work with multiple feature maps with different resolutions, transformed into a sequence of embeddings with positional and level encodings. This means we need to sample the deformed reference points from all the feature maps with different scales, keeping track of which part of the sequence of embeddings represents each feature map and the ratios of paddings in each feature map to properly pick the right samples.

The above shows details of the multihead deformable attention module to a limited extent. (The number of heads and offsets are variables, and the embedding dimension depends on the feature map size and the linear layer.) Though the complexity is now linear due to a constant number of values from each feature map being used (which is smaller than $HW$ ), deformable DETR did not contribute to a extreme improvement in speed. This is likely due to the use of multiple feature maps and the complicated computations in its implementation. It also did not lead to a significant improvement in performance surprisingly, though it still had some success in improving both speed and performance, that we still see the use of deformable attention modules in state-of-the-art DETR models.

DAB DETR

Deformable DETR primarily focused on improving the transformer encoder, and we've been using decoder embeddings and learnable object queries without much concern. The implicit assumption has been that decoder embeddings should contain generic information about the contents of the image, and that object queries should encode the locations of the bounding boxes, which are then refined by the signals from the encoder during the decoder's cross-attention. However, it was empirically observed that an encoder-only structure could speed up convergence, and that cross-attention was the factor contributing to slow convergence, whose only difference from self-attention being the use of decoder embeddings and object queries. Hence, we can analyze them and find the areas for improvement.

Here, we can observe how positional embeddings applied to the keys from the encoder output interfere with the object queries to produce positional attention. Researchers found that the attention map of vanilla DETR is foggy like we can see from the above. To produce a better positional attention map, they proposed using sinusoidal positional encodings of $x$ and $y$ for the learnable anchors $(x, y, w, h)$ instead of the vanilla embeddings used for object queries, allowing for better interference (approach used in Conditional DETR).

This positional modulation, through interference, led to a better positional attention map shaped like a Gaussian kernel, slightly faster convergence, and improved performance. However, the size of these kernels was fixed. To address this, learnable weights ( $\frac{w_o}{w}$ and $\frac{h_o}{h}$ ) can be multiplied to the positional encodings of $x$ and $y$ , respectively, allowing the model to control positional attention and manipulate the kernel size. The output of the decoder layer can be used to iteratively refine the anchor boxes until the final set of predictions is made, while using skip connections to pass the gradient directly to the initial learnable anchor boxes. The introduction of this size modulated cross-attention by DAB DETR (Dynamic Anchor Boxes) improved the convergence and performance of both vanilla DETR and Deformable DETR.

Conclusion

In this article, we discussed detection transformer DETR, which introduced bipartite matching to eliminate NMS, Deformable DETR, which improved encoder attention, and DAB DETR, which improved the object queries for better positional attention in the decoder. These improvements in the encoder and decoder improved DETR's convergence, inference speed, and performance, though DAB Deformable DETR still does not outperform YOLOv8. In the next article, we will introduce further improvements that led to competitive performance by DETR.

Resources

Carion. et al. 2020. End-to-End Object Detection with Transformers. ArXiv.
CompSci. 2016. The Munkres Assignment Algorithm (Hungarian Algorithm). YouTube.
Gaiduk, M. 2024. Reading DAB Detr source code. YouTube.
Gaiduk, M. 2024. Reading Deformable DETR source code. YouTube.
Gindi, J. 2021. Interpolation. Jack Gindi.
Liu, S. 2022. DAB-DETR: Dynamic Anchor Boxes Are Better Queries For DETR. ICLR 2022.
Zhu, X. et al. 2021. Deformable DETR: Deformable Transformers For End-To-End Object Detection. ICLR 2021.