Road to ML Engineer #62 - Structure from Motion

Last Edited: 6/16/2025

The blog post introduces structure from motion (SfM) in computer vision.

ML

In the previous article, we discussed how having two perspectives is useful in understanding the 3D world based on epipolar geometry and how it also faces difficulties and limitations. As a way of alleviating those issues, we can consider increasing the number of perspectives (by introducing more cameras or taking a video with a moving camera) for understanding the 3D scene and estimating the camera parameters (like humans do) through the technique called structure from motion (SfM). In this article, we will discuss the basics of SfM and how having multiple perspectives enhances understanding of the 3D world.

Affine Structure from Motion

To get started, we can formally define the structure from motion problem. In the general setup shown below, we have mm cameras with camera projection matrices MiM_i and nn 3D points XjX_j in the scene assumed to be visible to all the cameras. The aim is to recover both the structure of 3D points in the scene (nn points XjX_j) and the camera motion (mm camera projections MiM_i) from all the corresponding projections xijx_{ij}. To understand the approach to the SfM problem, we can start by tackling the simpler version of the problem, where cameras perform affine transformations instead of projective transformations (affine cameras).

SfM

The simplification allows the affine camera transformation from XX to xx to be expressed simply using Euclidean coordinates as xij=MiXj=[Aibi]Xjx_{ij}=M_iX_j=[A_i \quad b_i]X_j, where AA is a 2×32 \times 3 matrix and bb is a 2D vector. Hence, the number of unknowns for cameras and points are 8m88m-8 and 3n3n, respectively, and we need to estimate them with 2mn2mn equations from mnmn observations. We can use this relationship of number of unknowns and equations (8m+3n8<2mn8m+3n-8 \lt 2mn) to determine if we have enough observations or corresponding projections.

Tomasi-Kanade Factorization Method

From the sufficient number of observations, we can deduce information about structure and motion using the Tomasi-Kanade factorization method, which utilizes normalization and factorization. The method first determines the centroids of the projections for each camera as xˉi=1nk=1nxik\bar{x}_i=\frac{1}{n}\sum_{k=1}^{n} x_{ik} and normalizes the projections using x^ij=xijxˉi\hat{x}_{ij}=x_{ij}-\bar{x}_i, which can be further derived as follows.

x^ij=xijxˉi=AiXj+bi1nk=1nAiXk1nk=1nbi=Ai(Xj1nk=1nXk)=Ai(XjXˉ)=AiXj^ \hat{x}_{ij}=x_{ij}-\bar{x}_i = A_iX_j+b_i-\frac{1}{n}\sum_{k=1}^{n} A_iX_k -\frac{1}{n}\sum_{k=1}^{n} b_i \\ = A_i(X_j-\frac{1}{n}\sum_{k=1}^{n} X_k) = A_i(X_j-\bar{X}) = A_i\hat{X_j}

Here, Xˉ\bar{X} is the centroid of the 3D points. If we define Xˉ\bar{X} to be the center of the world reference system (0,0,0)T(0,0,0)^T, then we can set Xj^=Xj\hat{X_j}=X_j and simplify the relationships between the projections and structure and motion as x^ij=AiXj\hat{x}_{ij}=A_iX_j. Then, we can define a 2m×n2m \times n measurement matrix DD that contains the set of normalized projections (it's 2m2m because each projection is a 2D vector) and rewrite the relationship as D=AX=MSD=AX=MS, where AA contains all AiA_i as rows and X=SX=S contains all XjX_j as columns.

Since DD is expressed as the product of a 2m×32m \times 3 matrix and a 3×n3 \times n matrix, it has rank 3 and can be used to estimate structure and motion SS and MM by performing a rank-3 approximation U3Σ3V3TU_3\Sigma_3V_3^T with SVD (where M=U3Σ3M=U_3\sqrt{\Sigma_3} and S=Σ3V3TS=\sqrt{\Sigma_3}V_3^T). However, the estimation has affine ambiguity since any invertible affine transformation (rotation, translation, scaling, and shearing) can be applied like D=(MA1)(AS)D=(MA^{-1})(AS) and will still result in the same DD.

Perspective Structure from Motion

In the real-world scenario where cameras perform projective transformations, we can perform factorization using x=MXx = MX in homogeneous coordinates, based on enough observations to solve for the 11m+3n1511m + 3n - 15 unknowns (since MM has 11 unknowns) with 2mn2mn equations. However, similarly to affine SfM, any invertible p rojective transformation HH can be applied to MM and SS, meaning the estimations are inherently subjected to perspective ambiguity.

Although still subjected to perspective ambiguity, we can also estimate MM and SS from enough observations using an algebraic approach based on epipolar geometry (If you are unsure about epipolar geometry, check out the article Road to ML Engineer #61 - Epipolar Geometry). First, we can express M~1=M1H1=[I0]\tilde{M}_1 = M_1H^{-1} = [I \quad 0], M~2=M2H1=[Ab]\tilde{M}_2 = M_2H^{-1} = [A \quad b], and X~=HX\tilde{X} = HX to account for perspective ambiguity. Then, we can establish the relationships between the projections, structures, and motions as x=M~1X~=[I0]X~x = \tilde{M}_1\tilde{X} = [I \quad 0]\tilde{X} and x=M~2X~=[Ab]X~x' = \tilde{M}_2\tilde{X} = [A \quad b]\tilde{X}.

Here, we can further derive xx' and obtain x=A[I0]X~+b=Ax+bx' = A[I \quad 0]\tilde{X} + b = Ax + b, which expresses the relationship between the corresponding projections. Since xx' and bb both lie on the epipolar plane, we can take the cross product between them x×b=(Ax+b)×b=Ax×bx' \times b = (Ax + b) \times b = Ax \times b, which yields a vector normal to the epipolar plane. Here, xx' is normal to Ax×bAx \times b, so we can establish that xT(b×Ax)=0x'^T(b \times Ax) = 0. Using the matrix multiplication expression of the cross product, we arrive at xT[b×]Ax=0x'^T[b_{\times}]Ax = 0, resulting in F=[b×]AF = [b_{\times}]A from pTFp=0p'^TFp = 0.

From the above, we can get A=[b×]FA = -[b_{\times}]F and M~2=[b×]Fb\tilde{M}_2 = -[b_{\times}]F \quad b. Furthermore, since bb lies on the epipolar plane, Fb=(Ax+b)b=0Fb = (Ax + b)b = 0, meaning that bb is the epipole ee and M~2=[e×]Fe\tilde{M}_2 = -[e_{\times}]F \quad e. We can obtain the estimate of FF using the eight-point algorithm we covered in the previous article, which allows us to obtain the estimate of M~2\tilde{M}_2. Finally, we can perform this with all the other cameras to get the estimates of all the M~\tilde{M}, which can then be used to estimate X~\tilde{X} using triangulation.

Bundle Adjustment & Self-Caliberation

In the previous section, we saw how we can use the factorization approach with SVD and the algebraic approach using FF and triangulation in the general case to get estimates of structure and motion with perspective ambiguity. However, both are subject to inherent limitations. The factorization approach assumes all points are visible to all cameras, which is often not true due to occlusions, and the algebraic approach can only process pairs of camera perspectives and cannot optimize the estimates for all cameras.

To address these limitations, we often use bundle adjustment, where we apply a non-linear method to refine the estimates (after factorization and/or the algebraic approach) by minimizing the error E(M,X)=i=1mj=1nD(xij,MiXj)2E(M, X) = \sum_{i=1}^m \sum_{j=1}^n D(x_{ij}, M_iX_j)^2. This method allows us to arrive at better estimates as more views are taken into account (which reduces the impact of each occlusion and error to some extent) during optimization. Even after bundle adjustment, however, we are still hindered by the quality of estimates due to the difficulties and limitations of the correspondence problem and perspective ambiguity.

We can address the ambiguity to some extent through self-calibration, which can be achieved using single-view metrology constraints (horizontal lines) and other approaches. Since self-calibration allows us to reduce ambiguity on the camera projection matrix MM, we can reduce perspective (or affine) ambiguity to similarity ambiguity (rotation, translation, and scaling) during bundle adjustments. This enables a clearer understanding of the shape of the 3D objects and 3D scene (useful for 3D reconstruction). However, the presence of similarity ambiguity even with an intrinsically calibrated camera reveals an inherent limitation that knowing the absolute scales and positions of objects from any number of images is simply impossible.

We may find the relative depths of the objects with triangulations and disparity, as we have discussed in this and previous articles, but we simply cannot find the absolute scales and positions of them without making further assumptions (an educated guess of the object size) and collecting more data (calibrations using points with known positions in the world reference system). Our vision is most likely running all of these processes (solving structure from motion and disambiguating the object scales to some extent with educated guesses and calibrations using reference points like our hands) to estimate the absolute scales and positions.

Conclusion

In this article, we introduced affine and perspective SfMs with multiple views and how they can offer better estimates of structure and motion than having only one or two views. We also revealed the existence of an inherent limitation posed by similarity ambiguity, on top of the known difficulties and limitations of the correspondence problem that impacts the estimate quality. Lastly, we analyzed how we are possibly addressing these in the real world via complex processing, inference, and reference points (inaccessible when looking at images). In the next article, we will discuss alternative systems that aim to alleviate some of these limitations.

Resources