Road to ML Engineer #53

In previous articles on DETR, we discussed several improvements to the encoder, decoder, query selection, and loss functions that significantly improved performance, assuming a convolutional backbone like ResNet50. However, it's possible to use a transformer-based backbone, as demonstrated by Roboflow's RF-DETR, which is Deformable DETR utilizing DINO (a gigantic ViT pretrained by Meta) as its backbone. Some papers on previously discussed DETR models also experimented with ViT backbones and observed high performance. Considering that a backbone can be used in virtually any other image task than object detection, improving the backbone is of paramount importance. Hence, in this article, we will discuss improvements made to ViT.

Swin Transformers

Although ViT applies self-attention to image patches to make computation tractable, it still scales quadratically because all patches need to attend to each other. Also, ViT lacks a hierarchical architecture and applies attention globally, which increases receptive fields at the beginning and reduces inductive bias for potential performance gains. This comes at the expense of computational cost, a tendency towards slower convergence, higher data requirements, and incompatibility with object detectors that utilize hierarchical feature maps.

Hence, Swin (shifted window) Transformer introduces window-based multihead self-attention (W-MSA) that confines attention between patches within the same window, effectively reintroducing the inductive bias of locality and reducing complexity to linear. To allow neighboring patches in different windows to still attend to each other, the windows are shifted to some degree (SW-MSA), similar to how CNNs operate. Furthermore, by gradually merging patches (by passing the concatenated neighboring four patches to a dense layer) while maintaining the same window size, it effectively reduces the number of windows and allows them to cover more patches, analogous to how CNNs downsample. This enables the creation of hierarchical feature maps with features ranging from local to global attention.

Since Swin Transformer merges patches, W-MSA and SW-MSA layers must rely on the relative positions of the patches instead of absolute positions. Therefore, they apply learnable relative position biases $\hat{B} \in \text{R}^{(2H-1)(2W-1)}$ projected to all patch pairs $B \in \text{R}^{M^2}$ (where $M$ is the window's side length) and added to the attention weights, $\text{Softmax}(\frac{QK^T}{\sqrt{d}} + B)V$ . It also uses a roll function (available in both TensorFlow and PyTorch) to perform cyclical shifts and applies an attention mask to restrict patches from attending to the shifted patches, enabling proper and efficient SW-MSA.

Utilizing these techniques, along with patch embedding creation via a convolutional layer, data augmentations, and stochastic depth (randomly choosing to propagate only via residual connections), Swin Transformer achieved state-of-the-art results in image classification, semantic segmentation, and object detection. In fact, DINO DETR observed the best results with a Swin-L (large Swin Transformer) backbone instead of ResNet50 (although CoDETR surprisingly performed better with a ViT-L backbone).

DAT

Swin Transformer applies data-agnostic attention, where the attention mechanism does not change depending on the data. It uses fixed-size windows to compute attention. Hence, even with the shifted window mechanism, it can fail to attend to important patches that are slightly far away while still attending to meaningless patches. This is not a problem for the initial blocks, which are intended to capture local attention, but it can become problematic for the later blocks. Therefore, Deformable Attention Transformers (DAT) substitutes the data-agnostic SW-MSA layers in the last few blocks with data-dependent deformable attention that deforms pre-defined reference points using offsets computed from the data.

The above illustrates the deformable attention used in DAT. Unlike the deformable attention used in Deformable DETR, which computes a few offsets (typically 4) for every query and computes attention weights without deriving keys (which can be considered closer to a convolutional operation than attention), DAT computes many query-agnostic offsets to obtain both values and keys for self-attention. This is based on the observation that global attentions do not vary much depending on the queries, which allows us to avoid computing offsets for every query, compute more offsets efficiently, and retain more features as a backbone.

For efficient offset generation, we use an offset network with depth-wise convolution (which uses an ( $H \times W \times 1$ ) kernel for each channel) with stride $r$ and a pointwise ( $1 \times 1$ ) convolution with channel dimension 2. The reference points are linearly spaced 2D coordinates $\{(0,0),...,(H/r - 1, W/r - 1)\}$ , normalized to the range $[-1, 1]$ (and unnormalized back when sampling). Similarly to the Swin Transformer, DAT applies relative position bias from a learnable bias table, though it must use a parameterized bias table $\phi(\hat{B}; R)$ that takes continuous relative displacements $R$ between queries and keys (with a range of $[-1, 1]$ ) to account for all possible offset values. (We do not know where the keys will end up in advance due to the offsets.)

Introducing the deformable attention module in the last few layers, the original paper demonstrated that DAT slightly outperforms the Swin Transformer consistently in image classification (ImageNet), semantic segmentation (ADE20K), object detection, and instance segmentation (COCO). Since it still produces hierarchical feature maps, it is compatible with the object detectors we have discussed so far (though this compatibility has not been experimented with extensively, as far as I know, unlike Swin, potentially due to DAT's less accessibility and relatively minor reported performance gain).

Conclusion

In this article, we discussed architectural improvements in the ViT backbone made by Swin and DAT that introduced higher performance, better complexity, and better compatibility with object detectors and other vision models that utilize hierarchical feature maps. However, it is important to reiterate that these models have higher inductive biases than vanilla ViT, potentially making them less performant when more compute and data become accessible. (It might be the reason why RF-DETR and CoDETR use and perform better with a vanilla ViT backbone.)

Resources

Gaiduk, M. 2024. Reading SWIN transformer source code - Image Recognition with Transformers. YouTube.
Hayashi, M. 2022. 深さ単位分離可能畳み込み (Depthwise Separable Convolution). CVML.
Liu, Z. et al. 2021. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. ArXiv.
Robicheaux, P. et al. 2025. RF-DETR: A SOTA Real-Time Object Detection Model. Roboflow.
Xia, Z. et al. 2022. Vision Transformer with Deformable Attention. ArXiv.