Road to ML Engineer #64 - Monocular Depth Estimation

So far, we have been looking into single-view metrology, epipolar geometry, SfM, and stereo systems for understanding the capabilities and limitations of mathematically analyzing 3D scenes and objects from images. We have also briefly discussed how humans might be overcoming those limitations with assumptions and inferences, and how deep learning models can potentially do something similar. Hence, from this article onwards, we will start discussing the specifics of how deep learning techniques, that we have and have not covered so far, can be leveraged to make an algorithm learn to understand the 3D world well with appropriate assumptions and inferences, like humans, and perhaps even better than humans, starting from monocular depth estimation.

MiDaS & DPT

From single-view metrology, we have found that we can only deduce the orientations of 3D planes from a single image using horizontal lines, and how humans might be inferring relative depths from our experiences with the real world. For a deep learning model to do the same, we have also learned throughout the deep learning series that data quantity and quality are crucial for high-quality inference and generalization, though those are hard to obtain. MiDaS (most likely standing for mixed datasets) aimed to overcome the limitations around inaccessibility of objects' scales and difficulty in terms of data quantity and quality by using scale- and shift-invariant loss and Pareto optimal dataset mixing.

t(d) = \text{median}(d), ~ s(d) = \frac{1}{HW} \sum_{i=1}^{HW} |d_i - t(d)| \\ \hat{d}_i = \frac{d_i - t(d)}{s(d)} \\ L = \sum_{i=1}^{HW} |\hat{d^*}_i - \hat{d}_i|

Scale- and shift-invariant loss is implemented by shifting both the predictions $d^*$ and ground truths $d$ by their median and scaling them by the differences from the median, allowing the model to focus on learning the orientations of the objects and their relative depths and reducing the distribution shifts between the datasets. Pareto optimal dataset mixing is where we interpret the datasets as different tasks and seek a solution where the loss cannot be decreased on any dataset without increasing any other, and it was empirically observed to perform better than naive mixing. It also uses semantic segmentation as an auxiliary task to train the encoder and decoder, as semantic understanding must be helpful for depth estimation.

A few months after MiDaS achieved fairly high-quality depth maps as a deep convolutional network with a ResNet backbone, a vision transformer paper came out, demonstrating transformers' less limited receptive fields and higher performance in the context of image classification. Suspecting that the larger receptive fields and the same input and output dimensions could prevent feature loss and were well-suited for dense prediction, DPT (dense prediction transformer) introduced transformer layers to MiDaS's decoder that processes embeddings from a ViT or ResNet backbone and passes the output embeddings to the convolutional blocks, which then process extra read tokens, concatenate, resample, and fuse the resulting embeddings at different levels for the head to produce the final output. From the change in the backbone and decoder, DPT achieved better metrics than MiDaS (and other semantic segmentation models) and demonstrated better global coherence and finer details.

Depth Anything

Although MiDaS and DPT achieved some level of success in monocular depth estimation, they were still lacking enough high-quality data to fully train the models. This was because training required stereo cameras or manual annotations, limiting the scenes to environments where those vision systems could be employed. Image segmentation tasks faced similar data collection challenges, but the Segment Anything Model (SAM) achieved remarkable results by training a zero-shot, promptable segmentation model. Specifically, SAM created a data engine with a model in the loop and utilized pretrained image and prompt encoders (if you are unfamiliar with SAM, I recommend checking out the article Road to ML Engineer #58 - Segment Anything).

Inspired by SAM, Depth Anything created pseudo-labeled datasets by training a large teacher model on a labeled dataset and using it to annotate vast unlabeled datasets (like SA-1B). This pseudo-labeled data was then fed to a smaller student model, along with the labeled dataset, with data perturbations (color saturation and CutMix). The data perturbation in this knowledge distillation was intended to prevent the student from simply imitating and reproducing the teacher's errors. The teacher's encoder was a DINOv2 encoder initialized with pretrained weights for semantic understanding, and a cosine similarity loss was introduced to measure the differences in teacher and student encoded outputs, providing semantic assistance for the student (while cutting off the loss at a certain threshold to avoid perfect imitation).

Both the teacher and student models used a DPT decoder with a ViT backbone and scale- and shift-invariant loss from MiDaS. Depth Anything achieved remarkable results in zero-shot monocular depth estimation and demonstrated moderate capability in metric depth estimation (which aims to measure absolute depths instead of relative depths) after fine-tuning. The higher performance can be attributed to the use of unlabeled datasets (pseudo-labeled datasets) that expanded the size of the datasets and semantic assistance by DINOv2, consistent with observations made with MiDaS, DPT, and SAM.

Depth Anything V2

While real labeled images helped Depth Anything V1 (V1) achieve some generalizability to unseen real images, real labeled images inevitably face measurement errors and issues with stereo matching (correspondence) and tend to be coarse, which is not ideal. Synthetic labeled images, on the other hand, do not have these problems inherently and even avoid ethical and privacy concerns, making them more ideal training datasets. However, they also have their own unique problems, such as being "too perfect," as described in the paper of Depth Anything V2 as "too clean in color and ordered in layout," and tendency to have limited scene variety (living rooms and streets). This distribution shift, or sim-to-real gap, can cause generalization problems when applied to real images.

To address this gap while enjoying the precisions and quantity of synthetic images, Depth Anything V2 (V2) leverages DINOv2-G, the largest DINOv2 with the highest generalization capability, as a teacher trained with synthetic images to produce pseudo-labeled real images from unlabeled images for student teacher learning. This created a highly precise and diverse dataset with minimal distribution shift. V2 also created DA-2K, which incorporates SAM and uses a manual voting system to obtain the relative depths of objects, for measuring performance on fine-grained predictions, robust predictions in complex scenes, transparent objects, and other relevant aspects not typically measured by conventional metrics. V2 followed the same model architecture as V1 and achieved similar results on conventional benchmarks, but outperformed on DA-2K.

Conclusion

In this article, we covered MiDaS, DPT, and Depth Anything models, which utilize various deep learning techniques we have discussed so far (and few that we hadn't) in the series to perform inferences and overcome the limitations of single-view metrology, much like humans do. Monocular depth estimation is helpful in situations with strict hardware limitations and even for other complex computer vision tasks, which we will discuss in future articles. For more information on the techniques and their qualitative evaluations, I recommend checking out the papers cited below.

Resources

Ranftl, R. et al. 2020. Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer. ArXiv.
Ranftl, R. et al. 2021. Vision Transformers for Dense Prediction. ArXiv.
Yang, L. et al. 2024. Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data. ArXiv.
Yang, L. et al. 2024. Depth Anything V2. ArXiv.