The blog post introduces structure from motion (SfM) in computer vision.

In the previous article, we discussed how having two perspectives is useful in understanding the 3D world based on epipolar geometry and how it also faces difficulties and limitations. As a way of alleviating those issues, we can consider increasing the number of perspectives (by introducing more cameras or taking a video with a moving camera) for understanding the 3D scene and estimating the camera parameters (like humans do) through the technique called structure from motion (SfM). In this article, we will discuss the basics of SfM and how having multiple perspectives enhances understanding of the 3D world.
Affine Structure from Motion
To get started, we can formally define the structure from motion problem. In the general setup shown below, we have cameras with camera projection matrices and 3D points in the scene assumed to be visible to all the cameras. The aim is to recover both the structure of 3D points in the scene ( points ) and the camera motion ( camera projections ) from all the corresponding projections . To understand the approach to the SfM problem, we can start by tackling the simpler version of the problem, where cameras perform affine transformations instead of projective transformations (affine cameras).

The simplification allows the affine camera transformation from to to be expressed simply using Euclidean coordinates as , where is a matrix and is a 2D vector. Hence, the number of unknowns for cameras and points are and , respectively, and we need to estimate them with equations from observations. We can use this relationship of number of unknowns and equations () to determine if we have enough observations or corresponding projections.
Tomasi-Kanade Factorization Method
From the sufficient number of observations, we can deduce information about structure and motion using the Tomasi-Kanade factorization method, which utilizes normalization and factorization. The method first determines the centroids of the projections for each camera as and normalizes the projections using , which can be further derived as follows.
Here, is the centroid of the 3D points. If we define to be the center of the world reference system , then we can set and simplify the relationships between the projections and structure and motion as . Then, we can define a measurement matrix that contains the set of normalized projections (it's because each projection is a 2D vector) and rewrite the relationship as , where contains all as rows and contains all as columns.
Since is expressed as the product of a matrix and a matrix, it has rank 3 and can be used to estimate structure and motion and by performing a rank-3 approximation with SVD (where and ). However, the estimation has affine ambiguity since any invertible affine transformation (rotation, translation, scaling, and shearing) can be applied like and will still result in the same .
Perspective Structure from Motion
In the real-world scenario where cameras perform projective transformations, we can perform factorization using in homogeneous coordinates, based on enough observations to solve for the unknowns (since has 11 unknowns) with equations. However, similarly to affine SfM, any invertible p rojective transformation can be applied to and , meaning the estimations are inherently subjected to perspective ambiguity.
Although still subjected to perspective ambiguity, we can also estimate and from enough observations using an algebraic approach based on epipolar geometry (If you are unsure about epipolar geometry, check out the article Road to ML Engineer #61 - Epipolar Geometry). First, we can express , , and to account for perspective ambiguity. Then, we can establish the relationships between the projections, structures, and motions as and .
Here, we can further derive and obtain , which expresses the relationship between the corresponding projections. Since and both lie on the epipolar plane, we can take the cross product between them , which yields a vector normal to the epipolar plane. Here, is normal to , so we can establish that . Using the matrix multiplication expression of the cross product, we arrive at , resulting in from .
From the above, we can get and . Furthermore, since lies on the epipolar plane, , meaning that is the epipole and . We can obtain the estimate of using the eight-point algorithm we covered in the previous article, which allows us to obtain the estimate of . Finally, we can perform this with all the other cameras to get the estimates of all the , which can then be used to estimate using triangulation.
Bundle Adjustment & Self-Caliberation
In the previous section, we saw how we can use the factorization approach with SVD and the algebraic approach using and triangulation in the general case to get estimates of structure and motion with perspective ambiguity. However, both are subject to inherent limitations. The factorization approach assumes all points are visible to all cameras, which is often not true due to occlusions, and the algebraic approach can only process pairs of camera perspectives and cannot optimize the estimates for all cameras.
To address these limitations, we often use bundle adjustment, where we apply a non-linear method to refine the estimates (after factorization and/or the algebraic approach) by minimizing the error . This method allows us to arrive at better estimates as more views are taken into account (which reduces the impact of each occlusion and error to some extent) during optimization. Even after bundle adjustment, however, we are still hindered by the quality of estimates due to the difficulties and limitations of the correspondence problem and perspective ambiguity.
We can address the ambiguity to some extent through self-calibration, which can be achieved using single-view metrology constraints (horizontal lines) and other approaches. Since self-calibration allows us to reduce ambiguity on the camera projection matrix , we can reduce perspective (or affine) ambiguity to similarity ambiguity (rotation, translation, and scaling) during bundle adjustments. This enables a clearer understanding of the shape of the 3D objects and 3D scene (useful for 3D reconstruction). However, the presence of similarity ambiguity even with an intrinsically calibrated camera reveals an inherent limitation that knowing the absolute scales and positions of objects from any number of images is simply impossible.
We may find the relative depths of the objects with triangulations and disparity, as we have discussed in this and previous articles, but we simply cannot find the absolute scales and positions of them without making further assumptions (an educated guess of the object size) and collecting more data (calibrations using points with known positions in the world reference system). Our vision is most likely running all of these processes (solving structure from motion and disambiguating the object scales to some extent with educated guesses and calibrations using reference points like our hands) to estimate the absolute scales and positions.
Conclusion
In this article, we introduced affine and perspective SfMs with multiple views and how they can offer better estimates of structure and motion than having only one or two views. We also revealed the existence of an inherent limitation posed by similarity ambiguity, on top of the known difficulties and limitations of the correspondence problem that impacts the estimate quality. Lastly, we analyzed how we are possibly addressing these in the real world via complex processing, inference, and reference points (inaccessible when looking at images). In the next article, we will discuss alternative systems that aim to alleviate some of these limitations.
Resources
- Hata, K. & Savarese, S. 2025. CS231A Course Notes 4: Stereo Systems and Structure from Motion. Stanford.
- Savarese, S. & Bohg, J. 2025. Lecture 7 Multi-view Geometry. Stanford.