Road to ML Engineer #65 - Stereo Matching

In the last article, we discussed how deep learning models achieved some level of success in monocular depth estimation despite the limitations of single-view metrology by utilizing scale- & shift-invariant loss, DPT, semantic assistance, and a data engine for collecting a large volume of high-quality data. As a natural progression, in this article, we will discuss the deep learning approach to stereo matching, which corresponds to epipolar geometry, where we match corresponding projections of two images to perform depth estimation with disparity.

3D CNNs & ConvGRU

One prominent approach to deep stereo matching is to construct a 4D cost volume ( $C \times D \times H \times W$ ), where $D$ represents disparity or displacement of the corresponding pixels, and $C$ represents channels or the latent features that we aggregate to arrive at the matching cost (how forceful it is to match a pixel pair). By taking the argmin or softmax after aggregating and collapsing the channel dimension ( $D \times H \times W$ ), we can determine the disparity with the minimal matching cost for every pixel ( $1 \times H \times W$ ) and obtain the disparity map, which is inversely proportional to the depth.

To arrive at the optimal latent features for optimal matching cost estimation, we can process the 4D volume with 3D CNNs, where we slide a 3D kernel instead of a 2D kernel across all $x$ , $y$ , and $z$ directions (corresponding to $H$ , $W$ , and $D$ ). It makes sense to use 3D CNNs here since the assumption of locality also applies to disparity (pixel displacements inversely proportional to depth), and they can reduce the number of parameters while manipulating the disparity dimension. 3D CNNs are also utilized in volumetric medical images (the depth dimension) and videos (the time dimension) and are implemented in both PyTorch and TensorFlow.

Recent advancements in stereo matching algorithms are primarily fueled by deep learning approaches that utilize 3D CNNs on 4D cost volumes and ConvGRU (GRU with convolutions instead of dense layers) and other recurrent units for iterative improvements of the disparity map using 4D cost volumes and global context. There has also been some research incorporating ViT in the workflow. However, all of the deep learning approaches have required fine-tuning for real-world applications and faced the problem of data collection, just like other segmentation models and monocular depth estimation models.

FoundationStereo

FoundationStereo became the first stereo matching algorithm capable of zero-shot stereo matching, achieving performance levels comparable to or even exceeding those of previously finetuned approaches. This was accomplished by leveraging DepthAnythingV2 as a strong monocular prior, employing hybrid cost volume filtering using convolutions and transformers, iteratively refining results with CovGRU, and utilizing model-in-the-loop data collection to generate high-fidelity, high-quality synthetic data. The following visualization illustrates the model architecture of FoundationStereo.

The side-tuning adapters (STAs) adapt monocular priors from the frozen DepthAnythingV2 to the stereo matching task by combining them with fine-grained, high-frequency features from multi-level CNNs. Specifically, an STA concatenates downsampled features from DepthAnythingV2 and feature maps of the same level to produce latent features $f_l$ and $f_r$ ( $C \times \frac{H}{4} \times \frac{W}{4}$ ) for the left and right images. A CNN context network is also used to obtain context features $f_c$ , which are fed into ConvGRU. Once left and right feature embeddings are obtained, 4D cost volumes ( $C \times \frac{D}{4} \times \frac{H}{4} \times \frac{W}{4}$ ) are constructed for subsequent processing.

V_{gwc}(g, d, h, w) = \hat{f}_{l g}(h, w) \cdot \hat{f}_{r g}(h, w-d) \\ V_{cat}(d, h, w) = [\text{Conv}(f_l)(h, w), \text{Conv}(f_r)(h, w-d)] \\ V_c(d, h, w) = [V_{gwc}(d, h, w), V_{cat}(d, h, w)]

To introduce the new disparity dimension, the width of $f_r$ is shifted by all $d \in \{0...D\}$ , and operations are performed on all $f_r(h, w-d)$ . These operations include group-wise correlation, where features are divided into 8 groups, and a dot product is applied to normalized features $\hat{f}_l(h, w)$ and $\hat{f}_r(h, w-d)$ for each group and each disparity. Concatenation is also performed, where both $f_l(h, w)$ and $f_r(h, w-d)$ undergo convolution and are concatenated. The resulting cost volumes $V_{gwc}$ and $V_{cat}$ serve for conventional similarity measurements (dot product for matching cost computation) and preservation of monocular features, respectively. $V_{gwc}$ and $V_{cat}$ are then concatenated to $V_c$ , which is processed by an attentive hybrid cost filtering (AHCF) module.

As the name suggests, AHCF employs a hybrid approach, utilizing autoencoder-shaped 3D CNNs for multi-level cost filtering and transformer attention for long-range context cost filtering. Specifically, AHCF utilizes up and downsampling separable 3D convolutions (depthwise separable convolutions was briefly discussed in relation to DAT) to decouple $K \times K \times K$ convolutions into $K \times K \times 1$ and $1 \times 1 \times K$ convolutions for analyzing space and disparity across multiple levels. It also uses Flash attention (a hardware-aware and highly performant attention mechanism) as a disparity transformer (DT) by downsampling the cost volume into disparity tokens and upsampling the output to $V_c$ using trilinear interpolation.

A softmax function is applied to the filtered cost volumes $V'_c$ , and the result is multiplied by the corresponding $d$ to obtain an initial expected disparity $d_0$ . This disparity map is then iteratively refined with ConvGRU, using $V'_c$ , $V_{corr}$ , and $f_c$ , until $d_K$ . The model is trained on self-curated synthetic data (generated using a model-in-the-loop approach) with the loss function $L = |d_0 - \bar{d}|_{smooth} + \sum_k \gamma^{K-k}||d_k - \bar{d}||$ , where $\bar{d}$ is the ground truth disparity, $||_{smooth}$ denotes the L1 loss, and $\gamma$ is 0.9. This loss function is designed to supervise iterative disparity map refinement with exponentially increasing loss. By deliberately combining multiple techniques, most of which have been seen in other models, like the above, FoundationStereo established itself as the first highly performant zero-shot stereo matching algorithm, capable of being deployed on arbitrary image pairs captured by stereo cameras (or two cameras rectified using OpenCV functions) and outperforming conventional approaches, RGBD cameras, and DepthAnythingV2 finetuned for metric depth estimation.

Conclusion

This article has covered 3D CNNs and ConvGRU as conventional deep learning approaches to stereo matching, as well as FoundationStereo, which achieved highly performant zero-shot stereo matching. Due to its accomplishment and expansion of the field, the FoundationStereo paper was nominated as a CVPR 2025 best paper candidate (though it did not win the award). For more details on the techniques and quantitative and qualitative results, we recommend checking out the original paper and GitHub project page cited below.

Resources

Wen, B. et al. 2025. FoundationStereo: Zero-Shot Stereo Matching. ArXiv.
Wen, B. et al. 2025. FoundationStereo: Zero-Shot Stereo Matching. GitHub.