Road to ML Engineer #62 - Structure from Motion

In the previous article, we discussed how having two perspectives is useful in understanding the 3D world based on epipolar geometry and how it also faces difficulties and limitations. As a way of alleviating those issues, we can consider increasing the number of perspectives (by introducing more cameras or taking a video with a moving camera) for understanding the 3D scene and estimating the camera parameters (like humans do) through the technique called structure from motion (SfM). In this article, we will discuss the basics of SfM and how having multiple perspectives enhances understanding of the 3D world.

Affine Structure from Motion

To get started, we can formally define the structure from motion problem. In the general setup shown below, we have $m$ cameras with camera projection matrices $M_i$ and $n$ 3D points $X_j$ in the scene assumed to be visible to all the cameras. The aim is to recover both the structure of 3D points in the scene ( $n$ points $X_j$ ) and the camera motion ( $m$ camera projections $M_i$ ) from all the corresponding projections $x_{ij}$ . To understand the approach to the SfM problem, we can start by tackling the simpler version of the problem, where cameras perform affine transformations instead of projective transformations (affine cameras).

The simplification allows the affine camera transformation from $X$ to $x$ to be expressed simply using Euclidean coordinates as $x_{ij}=M_iX_j=[A_i \quad b_i]X_j$ , where $A$ is a $2 \times 3$ matrix and $b$ is a 2D vector. Hence, the number of unknowns for cameras and points are $8m-8$ and $3n$ , respectively, and we need to estimate them with $2mn$ equations from $mn$ observations. We can use this relationship of number of unknowns and equations ( $8m+3n-8 \lt 2mn$ ) to determine if we have enough observations or corresponding projections.

Tomasi-Kanade Factorization Method

From the sufficient number of observations, we can deduce information about structure and motion using the Tomasi-Kanade factorization method, which utilizes normalization and factorization. The method first determines the centroids of the projections for each camera as $\bar{x}_i=\frac{1}{n}\sum_{k=1}^{n} x_{ik}$ and normalizes the projections using $\hat{x}_{ij}=x_{ij}-\bar{x}_i$ , which can be further derived as follows.

\hat{x}_{ij}=x_{ij}-\bar{x}_i = A_iX_j+b_i-\frac{1}{n}\sum_{k=1}^{n} A_iX_k -\frac{1}{n}\sum_{k=1}^{n} b_i \\ = A_i(X_j-\frac{1}{n}\sum_{k=1}^{n} X_k) = A_i(X_j-\bar{X}) = A_i\hat{X_j}

Here, $\bar{X}$ is the centroid of the 3D points. If we define $\bar{X}$ to be the center of the world reference system $(0,0,0)^T$ , then we can set $\hat{X_j}=X_j$ and simplify the relationships between the projections and structure and motion as $\hat{x}_{ij}=A_iX_j$ . Then, we can define a $2m \times n$ measurement matrix $D$ that contains the set of normalized projections (it's $2m$ because each projection is a 2D vector) and rewrite the relationship as $D=AX=MS$ , where $A$ contains all $A_i$ as rows and $X=S$ contains all $X_j$ as columns.

Since $D$ is expressed as the product of a $2m \times 3$ matrix and a $3 \times n$ matrix, it has rank 3 and can be used to estimate structure and motion $S$ and $M$ by performing a rank-3 approximation $U_3\Sigma_3V_3^T$ with SVD (where $M=U_3\sqrt{\Sigma_3}$ and $S=\sqrt{\Sigma_3}V_3^T$ ). However, the estimation has affine ambiguity since any invertible affine transformation (rotation, translation, scaling, and shearing) can be applied like $D=(MA^{-1})(AS)$ and will still result in the same $D$ .

Perspective Structure from Motion

In the real-world scenario where cameras perform projective transformations, we can perform factorization using $x = MX$ in homogeneous coordinates, based on enough observations to solve for the $11m + 3n - 15$ unknowns (since $M$ has 11 unknowns) with $2mn$ equations. However, similarly to affine SfM, any invertible p rojective transformation $H$ can be applied to $M$ and $S$ , meaning the estimations are inherently subjected to perspective ambiguity.

Although still subjected to perspective ambiguity, we can also estimate $M$ and $S$ from enough observations using an algebraic approach based on epipolar geometry (If you are unsure about epipolar geometry, check out the article Road to ML Engineer #61 - Epipolar Geometry). First, we can express $\tilde{M}_1 = M_1H^{-1} = [I \quad 0]$ , $\tilde{M}_2 = M_2H^{-1} = [A \quad b]$ , and $\tilde{X} = HX$ to account for perspective ambiguity. Then, we can establish the relationships between the projections, structures, and motions as $x = \tilde{M}_1\tilde{X} = [I \quad 0]\tilde{X}$ and $x' = \tilde{M}_2\tilde{X} = [A \quad b]\tilde{X}$ .

Here, we can further derive $x'$ and obtain $x' = A[I \quad 0]\tilde{X} + b = Ax + b$ , which expresses the relationship between the corresponding projections. Since $x'$ and $b$ both lie on the epipolar plane, we can take the cross product between them $x' \times b = (Ax + b) \times b = Ax \times b$ , which yields a vector normal to the epipolar plane. Here, $x'$ is normal to $Ax \times b$ , so we can establish that $x'^T(b \times Ax) = 0$ . Using the matrix multiplication expression of the cross product, we arrive at $x'^T[b_{\times}]Ax = 0$ , resulting in $F = [b_{\times}]A$ from $p'^TFp = 0$ .

From the above, we can get $A = -[b_{\times}]F$ and $\tilde{M}_2 = -[b_{\times}]F \quad b$ . Furthermore, since $b$ lies on the epipolar plane, $Fb = (Ax + b)b = 0$ , meaning that $b$ is the epipole $e$ and $\tilde{M}_2 = -[e_{\times}]F \quad e$ . We can obtain the estimate of $F$ using the eight-point algorithm we covered in the previous article, which allows us to obtain the estimate of $\tilde{M}_2$ . Finally, we can perform this with all the other cameras to get the estimates of all the $\tilde{M}$ , which can then be used to estimate $\tilde{X}$ using triangulation.

Bundle Adjustment & Self-Caliberation

In the previous section, we saw how we can use the factorization approach with SVD and the algebraic approach using $F$ and triangulation in the general case to get estimates of structure and motion with perspective ambiguity. However, both are subject to inherent limitations. The factorization approach assumes all points are visible to all cameras, which is often not true due to occlusions, and the algebraic approach can only process pairs of camera perspectives and cannot optimize the estimates for all cameras.

To address these limitations, we often use bundle adjustment, where we apply a non-linear method to refine the estimates (after factorization and/or the algebraic approach) by minimizing the error $E(M, X) = \sum_{i=1}^m \sum_{j=1}^n D(x_{ij}, M_iX_j)^2$ . This method allows us to arrive at better estimates as more views are taken into account (which reduces the impact of each occlusion and error to some extent) during optimization. Even after bundle adjustment, however, we are still hindered by the quality of estimates due to the difficulties and limitations of the correspondence problem and perspective ambiguity.

We can address the ambiguity to some extent through self-calibration, which can be achieved using single-view metrology constraints (horizontal lines) and other approaches. Since self-calibration allows us to reduce ambiguity on the camera projection matrix $M$ , we can reduce perspective (or affine) ambiguity to similarity ambiguity (rotation, translation, and scaling) during bundle adjustments. This enables a clearer understanding of the shape of the 3D objects and 3D scene (useful for 3D reconstruction). However, the presence of similarity ambiguity even with an intrinsically calibrated camera reveals an inherent limitation that knowing the absolute scales and positions of objects from any number of images is simply impossible.

We may find the relative depths of the objects with triangulations and disparity, as we have discussed in this and previous articles, but we simply cannot find the absolute scales and positions of them without making further assumptions (an educated guess of the object size) and collecting more data (calibrations using points with known positions in the world reference system). Our vision is most likely running all of these processes (solving structure from motion and disambiguating the object scales to some extent with educated guesses and calibrations using reference points like our hands) to estimate the absolute scales and positions.

Conclusion

In this article, we introduced affine and perspective SfMs with multiple views and how they can offer better estimates of structure and motion than having only one or two views. We also revealed the existence of an inherent limitation posed by similarity ambiguity, on top of the known difficulties and limitations of the correspondence problem that impacts the estimate quality. Lastly, we analyzed how we are possibly addressing these in the real world via complex processing, inference, and reference points (inaccessible when looking at images). In the next article, we will discuss alternative systems that aim to alleviate some of these limitations.

Resources

Hata, K. & Savarese, S. 2025. CS231A Course Notes 4: Stereo Systems and Structure from Motion. Stanford.
Savarese, S. & Bohg, J. 2025. Lecture 7 Multi-view Geometry. Stanford.