Road to ML Engineer #67 - Structure and Motion

The blog post introduces some deep learning algorithms for V-SLAM/SAM in computer vision.

In the previous article, we covered optical flow and feature tracking (Bayes filters) as related computer vision tasks for Structure from Motion (SfM). However, there are other computer vision tasks that are more closely related to SfM, called Visual Simultaneous Localization and Mapping (V-SLAM) or Structure and Motion (SaM). As their names suggest, these tasks involve estimating camera poses (rotation and translation) and depths of points to simultaneously localize the camera and map the scene captured by video frames or multiple images from different angles. In this article, we will cover advanced deep learning algorithms for V-SLAM/SaM and their prerequisite knowledge.

Gauss-Newton Method

In the article on SfM, we touched on using a non-linear method for bundle adjustment. The primary method used for bundle adjustment here is the Gauss-Newton method, which is an extension of Newton's method, or the Newton-Raphson method. Newton-Raphson uses a second-order Taylor expansion of a non-linear function $f(x)$ and gradually approaches a minimum by finding the minimum of the quadratic approximation, whose derivation is shown below.

f(x_{t+1}) = f(x_t + \alpha) \approx f(x_t) + f'(x_t)\alpha + \frac{1}{2}f''(x_t)\alpha^2 \\ 0 = \frac{d}{d\alpha} (f(x_t) + f'(x_t)\alpha + \frac{1}{2}f''(x_t)\alpha^2 ) = f'(x_t) + f''(x_t)\alpha \\ \alpha = -\frac{f'(x_t)}{f''(x_t)} \\ x_{t+1} = x_t - \frac{f'(x_t)}{f''(x_t)}

Here, we assume that the second derivative of the quadratic approximation (the Hessian) is positive, and the quadratic approximation is convex, allowing us to find the minimum by taking the derivative and setting it to zero. This yields the appropriate update value $\alpha$ using the Jacobian and Hessian, which can approach the minimum. Unlike gradient descent, this method considers the Hessian (or concavity), potentially converging faster for some functions. However, it can be unstable when the assumption of a positive Hessian is not met. Furthermore, computing the Hessian is often computationally expensive, making Newton's method impractical. The Gauss-Newton method addresses this by approximating the Hessian and utilizes Newton's method to minimize the square of the residuals.

r_i = y_i - f(x_i; \beta), ~ \beta^{*} = \min_\beta L = \min_\beta \sum_i r_i(\beta)^2 \\ \nabla_{\beta_j} L = \sum_i 2r_i \frac{\partial r_i}{\partial \beta_j} = -2 \sum_i r_i \frac{\partial f_i}{\partial \beta_j} = -2 J^T r \\ \nabla^2_{\beta_j \beta_k} L = -2 \sum_i \frac{\partial r_i}{\partial \beta_k}\frac{\partial f_i}{\partial \beta_j} + r_i\frac{\partial^2 f_i}{\partial \beta_j \beta_k} \\ = -2 \sum_i -\frac{\partial f_i}{\partial \beta_k}\frac{\partial f_i}{\partial \beta_j} + r_i\frac{\partial^2 f_i}{\partial \beta_j \beta_k} \approx 2 \sum_i \frac{\partial f_i}{\partial \beta_k}\frac{\partial f_i}{\partial \beta_j} = 2 J^T J \\ \beta^{(t+1)} = \beta^{(t)} - (\nabla^2_{\beta} L)^{-1} \nabla_{\beta} L \approx \beta^{(t)} - (2 J^T J)^{-1}(-2 J^T r)

In the above, we computed the first and second derivatives of $L$ with respect to $\beta$ to apply Newton's method to minimize $L$ on $\beta$ and find $\beta^*$ . To approximate the second derivative of $L$ , we first take the derivative of $\nabla_{\beta_j} L$ using the product rule and then discard the second part of the product to approximate the Hessian as $2 J^T J$ . This allows us to express parameter updates using only the Jacobian matrix and the residual column vector $r$ , which can be obtained easily. We typically use this Gauss-Newton method for non-linear bundle adjustments after obtaining initial estimates of points $X$ and motion $M$ from factorization and triangulation, and is a prerequisite for SLAM.

Mahalanobis Distance

Another prerequisite for SLAM is the Mahalanobis distance, which measures the distance between an observation and a distribution. It is a generalization of the Z-score $z=\frac{x-\mu}{\sigma}$ , which measures how many standard deviations away $x$ is from $\mu$ , to a distribution with multivariate mean and covariance. Given a probability distribution with mean $\vec{\mu}=(\mu_1, \mu_2, ..., \mu_n)^T$ and covariance matrix $\Sigma$ , the Mahalanobis distance of $\vec{x}=(x_1, x_2, ..., x_n)^T$ from the distribution is computed as $\sqrt{(\vec{x}-\vec{\mu})^T \Sigma^{-1} (\vec{x}-\vec{\mu})}$ . Since $\Sigma$ is positive semi-definite, the distance is always defined.

We can also compute the Mahalanobis distance between $\vec{x}$ and another point $\vec{y}$ with respect to the distribution with covariance matrix $\Sigma$ as $\sqrt{(\vec{x}-\vec{y})^T \Sigma^{-1} (\vec{x}-\vec{y})}$ . We can see that the Mahalanobis distance is equivalent to the Euclidean distance if the distribution has unit variance. In the context of SLAM, we use the Mahalanobis distance instead of the Euclidean distance for computing discrepancies between a set of previous correspondence predictions and a set of updated correspondence predictions, as we assume that points in the set of correspondence predictions are correlated (the Mahalanobis distance can account for correlation/covariance and is also scale-invariant and less sensitive to outliers, although at the expense of increased computational complexity).

RAFT

The final prerequisite to SLAM is RAFT (standing for Recurrent All-Pairs Field Transforms), which is a successful deep learning approach to optical flow estimation that makes use of a 4D correlation volume and iterative refinement by a ConvGRU informed by the correlation volume. RAFT first constructs a 4D correlation volume and a correlation pyramid by taking the feature map with 1/8 of the images' original resolution ( $\frac{H}{8} \times \frac{W}{8} \times D$ ) from each frame using a feature encoder, taking the dot product of all pairs of feature vectors ( $\frac{H}{8} \times \frac{W}{8} \times \frac{H}{8} \times \frac{W}{8}$ ), and pooling the result with different strides (1, 2, 4, 8) to arrive at a pyramid that preserves high resolution while capturing small to large potential displacements.

Then, the update part with a ConvGRU takes initial flow vectors $f_0$ and iteratively adjusts the optical flow estimation by computing $\Delta f$ , where $f_{t+1} = f_t + \Delta f$ . Specifically, it concatenates the previous flow estimation $f_{t-1}$ , context features from a context encoder, and correlations sampled (with bilinear sampling) from the correlation pyramid corresponding to neighboring pixels $N(x')$ with radius $r$ around the previously estimated corresponding pixel $x'$ based on the flow estimation $x' = x + f_{t-1}$ . This concatenated input is then processed with a hidden state to arrive at an updated hidden state, which can be passed to two convolutional layers to arrive at $\Delta f$ . The resulting flow estimation $f_T$ (typically $T=10$ ) has 1/8 of the images' original resolution, which can be upsampled to arrive at the final prediction. (For more details regarding the implementation and experimental results, I recommend checking out the original paper cited at the bottom of the article.)

RAFT achieved state-of-the-art accuracy and efficiency in various datasets for optical flow estimation, and most importantly, demonstrated strong generalization capability. The original paper attributes the success to the deliberate design of the correlation pyramid and correlation look-up with bilinear sampling, which efficiently computes and extracts relevant correlation features while remaining relatively high resolution, unlike autoencoder architectures, and light-weight iterative update that is capable of flexibly adjusting flow estimation to the scene. Another important thing to note is that RAFT implicitly solves the correspondence problem through flow estimation ( $x' = x + f$ ), which is essential for various 3D computer vision tasks involving multiple views and frames. Hence, we can consider using RAFT's capability to infer the structure and motion.

DROID-SLAM

The same authors of RAFT (Teed, Z. & Deng, J.) leveraged observations from RAFT and developed DROID-SLAM, which became the state-of-the-art deep learning algorithm for V-SLAM at the time, exhibiting strong generalization capability. To adapt RAFT to the SLAM task, which involves processing potentially more than two images capturing the same points, DROID-SLAM constructs a frame graph by sampling 12 keyframes (keyframes are frames with optical flow estimated by DROID-SLAM greater than 16px) and connecting keyframes within three time-steps. When a new keyframe is tracked during inference, it either discards the oldest keyframe or the keyframe with the least average optical flow magnitude. Then, DROID-SLAM processes all frame pairs $i$ and $j$ connected by edges.

\text{X} = (X, Y, Z, W)^T, ~ G_{ij}=G_j \circ G_i^{-1} \\ \Pi_c(\text{X}) = \begin{pmatrix} f_x \frac{X}{Z} + c_x \\ f_y \frac{Y}{Z} + c_y \end{pmatrix}, ~ \Pi_c^{-1}(p, d) = \begin{pmatrix} \frac{p_x - c_x}{f_x} \\ \frac{p_y - c_y}{f_y} \\ 1 \\ d \end{pmatrix} \\ p' = \Pi_c(G_{ij}\Pi_c^{-1}(p, d))

The above describes the formalization of V-SLAM in homogeneous coordinates that DROID-SLAM uses to relate the correspondence problem of $p$ and $p'$ to camera pose $G$ (camera external parameters) and depth $d$ , given camera intrinsic parameters $c=(f_x, f_y, c_x, c_y)$ . DROID-SLAM aims to iteratively adjust camera pose and depth estimation from linear methods of SfM by using a dense bundle adjustment layer (DBA), which computes the adjustment factor $\Delta \xi$ ( $G_{ij}' = e^{\Delta \xi}G_{ij}$ ) and $\Delta d$ ( $d_i'=\Delta d + d_i$ ) for all pairs, such that the deduced $p'$ agrees with the iteratively updated correspondence predictions $p_{ij}^{*}$ made by a RAFT-like model (whose architecture is almost identical to RAFT but predicts flow or residual $u$ to adjust correspondence and weights $W$ for DBA).

E(G', d')=\sum_{(i,j) \in \mathscr{E}}||p_{ij}^{*}-\Pi_c(G_{ij}'\Pi_c^{-1}(p_i, d_i'))||_{\Sigma_{ij}}^2, ~ \Sigma_{ij}=\text{diag}w_{ij} \\ E(x')=\sqrt{r(x')^TWr(x')}, ~ r(x') = p^{*}-\Pi_c(G'\Pi_c^{-1}(p, d')) \\ E(x') \approx \sqrt{(r(x) + J_{\Delta x} \Delta x)^TW(r(x) + J_{\Delta x} \Delta x)}, \\ r(x') \approx r(x) + (\nabla_{\Delta x}r(x)) \Delta x = r(x) + J_{\Delta x} \Delta x \\ \min E(x') = \min (r(x) + J_{\Delta x} \Delta x)^TW(r(x) + J_{\Delta x} \Delta x)

DBA specifically uses a weighted Mahalanobis distance $E$ between $p_{ij}^{*}$ and $p'$ like the above, whose weights are computed by ConvGRU units, to compute $\Delta \xi$ and $\Delta d$ that minimizes the distance. We can express the Mahalanobis distance in terms of a residual $r(x')$ , where $x'=[G', d']$ , and use a first-order Taylor approximation to express it with $r(x)$ , which we have access to, and $\Delta x$ that we wish to determine. Then, we can minimize the weighted square residuals to minimize the weighted Mahalanobis distance, since the square root function is a monotonically increasing function. Here, we can use the Gauss-Newton method (with a damping factor $\lambda$ for stability and convergence) to set up a linear system of equations and solve it as follows.

x' = x - (\nabla^2_x E)^{-1} \nabla_x E \\ x' - x = \Delta x = - (\nabla^2_x E)^{-1} \nabla_x E \\ \nabla^2_x E \Delta x = - \nabla_x E \\ (\nabla^2_x E + \lambda I) \Delta x = - \nabla_x E \\ \nabla_x E = -2J_x^TWr, ~ \nabla^2_x E = 2J_x^TWJ_x \\ (J_x^TWJ_x + \lambda I) \Delta x = J_x^TWr \\ \begin{bmatrix} J_G^TWJ_G + \lambda I & J_G^TWJ_d + \lambda I \\ J_d^TWJ_G + \lambda I & J_d^TWJ_d + \lambda I \end{bmatrix} \begin{bmatrix} \Delta \xi \\ \Delta d \end{bmatrix} = \begin{bmatrix} J_G^TWr \\ J_d^TWr \end{bmatrix} \\ \begin{bmatrix} \text{B} & \text{E} \\ \text{E}^T & \text{C} \end{bmatrix} \begin{bmatrix} \Delta \xi \\ \Delta d \end{bmatrix} = \begin{bmatrix} \text{v} \\ \text{w} \end{bmatrix} \\ \Delta \xi = [B - EC^{-1}E^T]^{-1}(v-EC^{-1}w), ~ \Delta d = C^{-1}(w - E^T \Delta \xi)

This leads to the above linear system of equations where $B$ , $C$ , and $E$ are matrices. We can make use of the Schur complement (as shown above) to solve this efficiently (you can check the resource cited at the bottom of the article for Schur complement. It essentially solves one variable first and uses the result to solve another, operating in a lower dimension). To obtain the values of $B$ , $C$ , $E$ , $v$ , and $w$ , we need to compute $J_G$ and $J_d$ , which can be done using the chain rule (the details of the derivations are available in the appendix of the original paper cited at the bottom of the article). DROID-SLAM is trained on the synthetic TarTanAir dataset with a flow loss for $u$ ( $p_{ij}^{*}=u+p'^{(t-1)}$ ) and a pose loss comparing ground truth and predicted pose for weights $W$ , backpropagated from DBA to ConvGRU, and it became the state-of-the-art SLAM algorithm with strong generalization capability to various scenes and stereo and RGB-D cameras.

MegaSAM

Despite the performance and architectural design of DROID-SLAM, which led to strong generalizability, it still struggled with dynamic videos, where cameras and objects move dynamically, such as those taken by handheld devices, hindering its application to real-world scenarios. MegaSAM, which received a best paper honorable mention at CVPR 2025, addresses this limitation by improving DROID-SLAM. MegaSAM learns focal lengths along with $G$ using a more robust Levenberg-Marquardt algorithm ( $(J^TWJ - \lambda \text{diag}(J^TWJ))\Delta x=J^TWr$ ), adjusting the damping factor based on curvature to balance speed and convergence. Furthermore, it learns an object movement probability map $m \in \mathbb{R}^{\frac{H}{8} \times \frac{W}{8}}$ for dynamic videos using an additional network $F_m$ after training solely on static videos and freezing the conventional model $F$ , and reflects the probability map in the confidence weights ( $\tilde{w}_{ij}=\bar{w}_{ij}+m_i$ ).

D_i^{\text{align}} = \hat{\alpha} D_i^{\text{rel}} + \hat{\beta}, \\ \hat{\alpha} = \frac{D_i^{\text{abs}} - \text{median}(D_i^{\text{abs}})}{D_i^{\text{rel}} - \text{median}(D_i^{\text{rel}})}, ~ \hat{\beta}=\text{median}(D_i^{\text{abs}}-D_i^{\text{rel}}) \\ E=\sum_{(i,j) \in \mathscr{E}}||p_{ij}^{*}-p_{ij}'||_{\Sigma_{ij}}^2+w_d\sum_i||d_i'-D_i^{\text{align}}||^2

MegaSAM also incorporates a monocular depth prior using DepthAnything and initializes camera focal length by randomly perturbing the ground truth value by 25% during training. It further performs depth and focal length initialization during inference with monocular depth priors $D^{\text{align}}$ and focal length estimation $\hat{f}$ from DepthAnything $D^{\text{rel}}$ , and with metric monocular depth and focal length estimation $D^{\text{abs}}$ and $\hat{f}$ from UniDepth as shown above. It also employs regularization with these monocular priors. The regularization weight $w_d$ is computed as $w_d = \gamma_d \exp(-\beta_d \text{median}(\text{diag}(H_d)))$ , because the Hessian matrix reveals the sensitivity of the reprojection error to perturbations of variables, and thus indicates uncertainty. A high Hessian implies high sensitivity and low uncertainty, suggesting less monocular regularization is preferred. Focal length optimization within DBA is also deactivated if $H_f < \tau_f$ , as a low Hessian suggests the focal length is likely unobservable from the input.

These modifications to DROID-SLAM in MegaSAM resulted in significant performance improvements across various scenes, including videos in-the-wild. However, it still faces limitations regarding videos where moving objects dominate the frames and those with strong radial distortions or varying focal lengths, suggesting that there are still area of improvements in SAM. (For more explanations and details regarding implementation and evaluation, I highly recommend checking out the original paper cited at the bottom of the article. Side Note: I find it interesting that both FoundationStereo and MegaSAM have similar architectures, where rich 4D representations are obtained and looked up for iteratively refining initial predictions, based on good monocular priors, with ConvGRU, despite some task-related differences.)

Conclusion

In this article, we introduced advanced SLAM or SAM algorithms, DROID-SLAM and MegaSAM, and their prerequisite knowledge (Gauss-Newton method, Mahalanobis distance, and RAFT). There are some details and relevant concepts that we could not delve into (Schur complement, Levenberg-Marquardt algorithm, UniDepth, CasualSAM, system architectures, etc.), so I highly recommend checking out the resources cited below and other relevant resources.

Resources

Aric LaBarr. 2023. What are Mahalanobis Distances. YouTube.
Meerkat Statstics. 2021. Gauss Newton - Non Linear Least Squares. YouTube.
Jitkomut Songsiri. 2021. Schur Complement. YouTube.
Li, Z. et al. 2024. MegaSaM: Accurate, Fast, and Robust Structure and Motion from Casual Dynamic Videos. ArXiv.
Piccinelli, L. et al. 2024. UniDepth: Universal Monocular Metric Depth Estimation. ArXiv.
Teed, Z. & Deng, J. 2020. RAFT: Recurrent All-Pairs Field Transforms for Optical Flow. ArXiv.
Teed, Z. & Deng, J. 2022. DROID-SLAM: Deep Visual SLAM for Monocular,Stereo, and RGB-D Cameras. ArXiv.