Road to ML Engineer #63 - Active & Volumetric Stereo

So far, we've covered epipolar geometry and SfM, and discovered that almost all processes for them face difficulties and limitations related to the correspondence problem. In this article, we will discuss alternative vision systems that aim to overcome these issues and further improve our understanding of the 3D world.

Active Stereo

Active stereo aims to simplify the correspondence problem by substituting one of the cameras in the stereo system with a projector. With this setup, the corresponding projection of a point on the projector's virtual image is easily identified by locating where the light is reflected and projected onto the image captured by the other camera. We can further simplify the problem by calibrating or rectifying the projector and the camera so that their planes are parallel. In this parallel setup, we can project vertical lines across the 3D scene and find the corresponding points by taking the intersection of the projected line and the epipolar line.

This setup can produce very accurate results when properly calibrated, but it is expensive and slow, since sweeping a vertical line across the entire object takes time. Because it is slow, it cannot capture deformations of object shapes in real time. Instead, we can project multiple vertical lines with a known color pattern all at once, allowing us to find all the correspondences and capture the 3D scene in real time. Many modern depth sensors, including the original version of Microsoft Kinect, utilize this concept using infrared laser projectors and sensors that work under any ambient light conditions.

Space & Shadow Carving

Another way of mitigating the correspondence problem is to avoid using corresponding projections for 3D reconstruction. If our aim is simply to capture the shape of a 3D object, we can assume that the object has a limited and known volume and apply a volumetric stereo approach. One volumetric stereo approach is space carving, where we estimate the shape of a 3D object in such a way that it is consistent with the object's silhouette in the image plane. Specifically, we can divide the limited volume into voxels in a voxel grid and eliminate those voxels that are inconsistent with the object's silhouette to carve the space.

Since space carving utilizes the object's silhouette, which can be easily obtained using a green screen, it appears more straightforward. However, it faces its own limitations. For example, reducing the size of the voxels for higher quality increases the number of voxels cubically, leading to a significant increase in processing time. Its quality also depends on the number of cameras, the consistency of the silhouette, and the shape of the object. For example, using only two cameras on a slightly moving object with concavities would result in poor 3D reconstruction.

To avoid the limitations around concavities, which is arguably the most critical issue, we can use shadow carving, where we introduce several light sources around the camera to cast self-shadows and eliminate the voxels that are within the visual cone of the shadows, which are likely part of the concavities. This method can improve the estimate provided by space carving, though it doesn't work well on highly reflective or non-reflective surfaces that don't cast self-shadows well.

Voxel Coloring

Another volumetric stereo approach is voxel coloring, where we assume a Lambertian object (perceived luminance, and therefore color, of any part of the object doesn't change with viewpoint location or pose) and color the voxels from multiple views. This approach is beneficial because it can capture both shape and color (texture) simultaneously. However, it faces drawbacks with non-Lambertian objects, where color consistency cannot be checked, and there is ambiguity and a lack of uniqueness in its solution.

To avoid the ambiguity, we can incrementally process the voxel grid layer by layer, starting from the outer layer, and perform a color consistency check at every layer, allowing us to introduce visibility constraints. If the same color in a voxel wasn't visible by at least two cameras, we can assume that the voxel is occluded and not part of the object. However, voxel coloring still faces limitations with non-Lambertian objects and the cubic scaling of the number of voxels.

Conclusions

In this article, we've covered active stereo, which simplifies the correspondence problem with a projector and enables accurate 3D reconstruction and depth sensing, and volumetric stereo, which avoids using corresponding points altogether and focuses on obtaining high-quality 3D reconstruction by carving the voxel grid with silhouettes and shadows or coloring them. As we've covered the basics of monocular and stereo vision systems and some alternative approaches that allow us to deduce information about 3D scenes and objects, we will start discussing how we can learn the best representations and perform computer vision tasks on them in the next article.

Resources

Hata, K. & Savarese, S. 2025. CS231A Course Notes 5: Active and Volumetric Stereo. Stanford.
Savarese, S. & Bohg, J. 2025. Lecture 8 Active stereo & Volumetric stereo. Stanford.