Road to ML Engineer #65 - Stereo Matching

Last Edited: 7/4/2025

The blog post introduces deep stereo matching in computer vision.

ML

In the last article, we discussed how deep learning models achieved some level of success in monocular depth estimation despite the limitations of single-view metrology by utilizing scale- & shift-invariant loss, DPT, semantic assistance, and a data engine for collecting a large volume of high-quality data. As a natural progression, in this article, we will discuss the deep learning approach to stereo matching, which corresponds to epipolar geometry, where we match corresponding projections of two images to perform depth estimation with disparity.

3D CNNs & ConvGRU

One prominent approach to deep stereo matching is to construct a 4D cost volume (C×D×H×WC \times D \times H \times W), where DD represents disparity or displacement of the corresponding pixels, and CC represents channels or the latent features that we aggregate to arrive at the matching cost (how forceful it is to match a pixel pair). By taking the argmin or softmax after aggregating and collapsing the channel dimension (D×H×WD \times H \times W), we can determine the disparity with the minimal matching cost for every pixel (1×H×W1 \times H \times W) and obtain the disparity map, which is inversely proportional to the depth.

To arrive at the optimal latent features for optimal matching cost estimation, we can process the 4D volume with 3D CNNs, where we slide a 3D kernel instead of a 2D kernel across all xx, yy, and zz directions (corresponding to HH, WW, and DD). It makes sense to use 3D CNNs here since the assumption of locality also applies to disparity (pixel displacements inversely proportional to depth), and they can reduce the number of parameters while manipulating the disparity dimension. 3D CNNs are also utilized in volumetric medical images (the depth dimension) and videos (the time dimension) and are implemented in both PyTorch and TensorFlow.

Recent advancements in stereo matching algorithms are primarily fueled by deep learning approaches that utilize 3D CNNs on 4D cost volumes and ConvGRU (GRU with convolutions instead of dense layers) and other recurrent units for iterative improvements of the disparity map using 4D cost volumes and global context. There has also been some research incorporating ViT in the workflow. However, all of the deep learning approaches have required fine-tuning for real-world applications and faced the problem of data collection, just like other segmentation models and monocular depth estimation models.

FoundationStereo

FoundationStereo became the first stereo matching algorithm capable of zero-shot stereo matching, achieving performance levels comparable to or even exceeding those of previously finetuned approaches. This was accomplished by leveraging DepthAnythingV2 as a strong monocular prior, employing hybrid cost volume filtering using convolutions and transformers, iteratively refining results with CovGRU, and utilizing model-in-the-loop data collection to generate high-fidelity, high-quality synthetic data. The following visualization illustrates the model architecture of FoundationStereo.

FoundationStereo

The side-tuning adapters (STAs) adapt monocular priors from the frozen DepthAnythingV2 to the stereo matching task by combining them with fine-grained, high-frequency features from multi-level CNNs. Specifically, an STA concatenates downsampled features from DepthAnythingV2 and feature maps of the same level to produce latent features flf_l and frf_r (C×H4×W4C \times \frac{H}{4} \times \frac{W}{4}) for the left and right images. A CNN context network is also used to obtain context features fcf_c, which are fed into ConvGRU. Once left and right feature embeddings are obtained, 4D cost volumes (C×D4×H4×W4C \times \frac{D}{4} \times \frac{H}{4} \times \frac{W}{4}) are constructed for subsequent processing.

Vgwc(g,d,h,w)=f^lg(h,w)f^rg(h,wd)Vcat(d,h,w)=[Conv(fl)(h,w),Conv(fr)(h,wd)]Vc(d,h,w)=[Vgwc(d,h,w),Vcat(d,h,w)] V_{gwc}(g, d, h, w) = \hat{f}_{l g}(h, w) \cdot \hat{f}_{r g}(h, w-d) \\ V_{cat}(d, h, w) = [\text{Conv}(f_l)(h, w), \text{Conv}(f_r)(h, w-d)] \\ V_c(d, h, w) = [V_{gwc}(d, h, w), V_{cat}(d, h, w)]

To introduce the new disparity dimension, the width of frf_r is shifted by all d{0...D}d \in \{0...D\}, and operations are performed on all fr(h,wd)f_r(h, w-d). These operations include group-wise correlation, where features are divided into 8 groups, and a dot product is applied to normalized features f^l(h,w)\hat{f}_l(h, w) and f^r(h,wd)\hat{f}_r(h, w-d) for each group and each disparity. Concatenation is also performed, where both fl(h,w)f_l(h, w) and fr(h,wd)f_r(h, w-d) undergo convolution and are concatenated. The resulting cost volumes VgwcV_{gwc} and VcatV_{cat} serve for conventional similarity measurements (dot product for matching cost computation) and preservation of monocular features, respectively. VgwcV_{gwc} and VcatV_{cat} are then concatenated to VcV_c, which is processed by an attentive hybrid cost filtering (AHCF) module.

As the name suggests, AHCF employs a hybrid approach, utilizing autoencoder-shaped 3D CNNs for multi-level cost filtering and transformer attention for long-range context cost filtering. Specifically, AHCF utilizes up and downsampling separable 3D convolutions (depthwise separable convolutions was briefly discussed in relation to DAT) to decouple K×K×KK \times K \times K convolutions into K×K×1K \times K \times 1 and 1×1×K1 \times 1 \times K convolutions for analyzing space and disparity across multiple levels. It also uses Flash attention (a hardware-aware and highly performant attention mechanism) as a disparity transformer (DT) by downsampling the cost volume into disparity tokens and upsampling the output to VcV_c using trilinear interpolation.

A softmax function is applied to the filtered cost volumes VcV'_c, and the result is multiplied by the corresponding dd to obtain an initial expected disparity d0d_0. This disparity map is then iteratively refined with ConvGRU, using VcV'_c, VcorrV_{corr}, and fcf_c, until dKd_K. The model is trained on self-curated synthetic data (generated using a model-in-the-loop approach) with the loss function L=d0dˉsmooth+kγKkdkdˉL = |d_0 - \bar{d}|_{smooth} + \sum_k \gamma^{K-k}||d_k - \bar{d}||, where dˉ\bar{d} is the ground truth disparity, smooth||_{smooth} denotes the L1 loss, and γ\gamma is 0.9. This loss function is designed to supervise iterative disparity map refinement with exponentially increasing loss. By deliberately combining multiple techniques, most of which have been seen in other models, like the above, FoundationStereo established itself as the first highly performant zero-shot stereo matching algorithm, capable of being deployed on arbitrary image pairs captured by stereo cameras (or two cameras rectified using OpenCV functions) and outperforming conventional approaches, RGBD cameras, and DepthAnythingV2 finetuned for metric depth estimation.

Conclusion

This article has covered 3D CNNs and ConvGRU as conventional deep learning approaches to stereo matching, as well as FoundationStereo, which achieved highly performant zero-shot stereo matching. Due to its accomplishment and expansion of the field, the FoundationStereo paper was nominated as a CVPR 2025 best paper candidate (though it did not win the award). For more details on the techniques and quantitative and qualitative results, we recommend checking out the original paper and GitHub project page cited below.

Resources