Road to ML Engineer #66 - Feature Tracking

In the previous articles, we discussed how deep learning models can tackle limitations of single-view metrology and epipolar geometry and perform sufficiently well on monocular depth estimation and stereo matching. Hence, in this article, we will discuss feature tracking, which tracks meaningful features (typically objects and humans) from a video and is one of the computer vision tasks that correspond to structure from motion.

Optical & Scene Flow

One useful quantity for various computer vision tasks involving video is optical flow, which is the 2D vector field describing the apparent motion displacement of all image pixels. It should not be confused with the motion field, which is the 2D vector field describing the 2D projection of actual movement of 3D points in the scene. We can mathematically represent the apparent velocity of a pixel $(x, y)$ in the x and y axes as $u(x, y, t)=\frac{\Delta x}{\Delta t}$ and $v(x, y, t)=\frac{\Delta y}{\Delta t}$ , respectively. These can then be used to express the optical flow vector of a pixel as $\vec{\text{u}} = [u, v]^T$ . The Lucas-Kanade method utilizes various assumptions to solve for the optical flow vectors for a pixel patch using least squares.

I(x, y, t) = I(x+\Delta x, y+\Delta y, t+\Delta t) \\ I(x+\Delta x, y+\Delta y, t+\Delta t) = I(x, y, t) + \frac{\partial I}{\partial x}\Delta x + \frac{\partial I}{\partial y}\Delta y + \frac{\partial I}{\partial t}\Delta t ... \\ \approx I(x, y, t) + \frac{\partial I}{\partial x}\Delta x + \frac{\partial I}{\partial y}\Delta y + \frac{\partial I}{\partial t}\Delta t \\ 0 = \frac{\partial I}{\partial x}\Delta x + \frac{\partial I}{\partial y}\Delta y + \frac{\partial I}{\partial t}\Delta t \\ = I_x u + I_y v + I_t

The first two equations above are consequences of the brightness constancy assumption and the small motion assumption, respectively. The brightness constancy assumption assumes that the intensity of pixels belonging to the same object does not change across frames, and the small motion assumption assumes that the motion is small and can therefore be approximated by a first-order Taylor approximation. The subsequent derivations utilize these two equations to obtain the optical flow equation, which can be further rearranged into a linear system, $-I_t = I_x u + I_y v = \nabla I \cdot \vec{\text{u}}$ , where $\nabla I=[I_x, I_y]^T$ represents the spatial gradient or edges. Although it appears we can solve this linear system, we only have one constraint and two unknowns, which leaves us to solve for the normal flow, given by $\frac{-I_t}{||\nabla I||}$ . This is why we can easily determine the pixel flow in the direction normal to edges, but it's difficult to obtain flow parallel to edges.

To mitigate this limitation, the Lucas-Kanade method takes a semi-local approach with a spatial smoothness assumption, which assumes that neighboring pixels belong to the same surface and share the optical flow vector $\vec{\text{u}}$ , allowing us to set up multiple constraints in various directions. Specifically, we end up with $-I_t(p_i)=\nabla I(p_i) \cdot \vec{\text{u}}$ for each of the $N^2$ neighboring pixels $p_i=[x_i, y_i]^T$ . Although increasing the number of pixels and thus constraints generally helps, we can still end up with an ambiguous optical flow vector when the pixel patch has gradients in the same direction and has low texture. Therefore, there have been attempts to use deep learning models to overcome these limitations of the Lucas-Kanade method and obtain high-quality optical flow, which is useful for feature tracking. The optical flow model can also be utilized for scene flow estimation, where we infer the 3D motion field of every point in the scene from the frames for subsequent tasks like autonomous driving and robotics.

Bayes Filter

Though optical flow and scene flow may be useful for tracking apparent and short movements of an object, they are not suitable for feature tracking involving occlusions and predictions based on past trajectories and motion patterns, which is often required in autonomous driving and robotics. Hence, we can use a partially observable Markov decision process (POMDP) to model feature tracking with long-term dependence. Here, control inputs $U$ may represent the camera's position and motion, states $X$ represent the object's true positions, velocities, and so on, and observations $Z$ are the results of object detections. (If you are unfamiliar with MDPs and POMDPs, I recommend checking out the RL subseries of this ML series.) The goal is to estimate the distributions of the current state $x_t$ for all $t$ , called posteriors, based on all other known variables.

In performing state estimation, we can assume that the state transition model $p(x_t | x_{t-1}, u_t)$ (where we assume a Markov state) and the measurement model $p(z_t | x_t)$ (where we assume observation only depends on the previous state) are easily obtainable. Then, we can derive a recursive expression for the state posterior $p(x_t|z_{1:t}, u_{1:t})$ that depends only on the previous posterior, the current observation $z_t$ , and the control input $u_t$ . Specifically, we can utilize Bayes' rule and conditional independence of observations to arrive at the expression for $p(x_t|z_{1:t}, u_{1:t})$ and leverage the Markov assumption to determine the expression for the prior or predicted current posterior, $p(x_{t}|z_{1:t-1}, u_{1:t})$ , using only the previous posterior, the transition model (which depends on the previous state and control input), and the measurement model (and a normalization factor).

p(x_t|z_{1:t}, u_{1:t}) = \frac{p(z_t | x_t)p(x_{t}|z_{1:t-1}, u_{1:t})}{p(z_{t}|z_{1:t-1}, u_{1:t})} \\ p(x_{t}|z_{1:t-1}, u_{1:t}) = \int_{x_{t-1}} p(x_t, x_{t-1} | z_{1:t-1}, u_{1:t}) dx_{t-1} \\ = \int_{x_{t-1}} p(x_t | x_{t-1}, z_{1:t-1}, u_{1:t})p(x_{t-1} | z_{1:t-1}, u_{1:t}) dx_{t-1} \\ = \int_{x_{t-1}} p(x_t | x_{t-1}, u_{1:t})p(x_{t-1} | z_{1:t-1}, u_{1:t}) dx_{t-1}

The above derivations unveil the recursive relationships of state posteriors. To determine the current state posterior, we can first compute the prior or predicted current posterior (without observation $z_t$ ) by taking the integral of the product between the state transition model and the previous state posterior (2nd equation). Then, we can filter the predicted current posterior by multiplying it with the measurement model and dividing it by a normalization constant (1st equation). The filtered current state posterior can then be used as the previous prior in the next prediction for $t+1$ . The first step, predicting the current posterior without a new observation, is called the prediction step, the second step, incorporating the new observation and filtering the prediction of the current posterior, is called the update step, and the recursive application of these steps involving Bayes' rule for performing state estimation is called the Bayes filter.

Kalman Filter

Although the Bayes filter is simple and generalizes to any probability distribution, we often cannot compute the normalization constant, which requires knowing the entire probability distribution. Hence, we often assume a specific kind of system and adopt a Bayes filter approach to those systems in practice. One common practical adaptation of the Bayes filter is the Kalman filter, which assumes linear-Gaussian systems with noise. Linear-Gaussian systems assume that all random variables (states, observations, inputs, noises) are multivariate Gaussians, and the state transition and measurement models are linear as follows.

x_t = A_t x_{t-1} + B u_{t} + w_t \\ z_t = C_t x_t + v_t

Here, $A_t$ , $B_t$ , and $C_t$ are assumed to be known, and $w_t \sim N(0, Q_t)$ and $v_t \sim N(0, R_t)$ are random Gaussian noises, called process and measurement noises, respectively, where $Q_t$ and $R_t$ are also known. By assuming all random variables are multivariate Gaussians, we can only keep track of means and covariances to represent the entire probability distributions. Also, the linearity of the system guarantees that subsequent variables remain Gaussian, making it practical to use in the real world (many object movements can be described as linear transformations). Here, the initial state $x_0$ is $x_0 \sim N(\mu_{0|0}, \Sigma_{0|0})$ , where the mean and covariance are known, and we would like the state posterior $p(x_t | z_{1:t}, u_{1:t}) \sim N(\mu_{t|t}, \Sigma_{t|t})$ for all $t$ . Using these models and assumptions, we can derive the prediction step for obtaining the predicted current posterior without observation $p(x_t | z_{1:t-1}, u_{1:t}) \sim N(\mu_{t|t-1}, \Sigma_{t|t-1})$ from the state transition as follows.

\mu_{t|t-1} = A_t \mu_{t-1 | t-1} + B u_{t} \\ \Sigma_{t|t-1} = A_{t-1} \Sigma_{t-1|t-1} A_{t-1}^T + Q_{t-1}

The first equation does not include $w_t$ because it has zero mean and does not affect $\mu_{t|t-1}$ . The second equation uses $A_{t-1} \Sigma_{t-1|t-1}$ to account for the covariance, which is represented as $\text{Cov}(Ax) = A \Sigma A^T$ . Since all covariance matrices are positive definite, $\Sigma_{t|t-1}$ is always larger than $\Sigma_{t-1|t-1}$ . It makes sense intuitively as well that the covariance gets larger after the prediction step because we are making a prediction on the current posterior without using the current observation $z_t$ unlike the previous posterior that takes the previous observation into account. In the next update step, we can use $\mu_{t|t-1}$ , $\Sigma_{t|t-1}$ , and the measurement model to filter the current posterior as follows.

\mu_{t|t} = \mu_{t|t-1} + K_t(z_t - C_t \mu_{t|t-1}) \\ \Sigma_{t|t} = \Sigma_{t|t-1} - K_t C_t \Sigma_{t|t-1} \\ K_t = \Sigma_{t|t-1} C_t^T (C_t \Sigma_{t|t-1} C_t^T + R_t)^{-1} (= C_t^{-1} \frac{C_t \Sigma_{t|t-1} C_t^T}{C_t \Sigma_{t|t-1} C_t^T+R_t})

Here, $z_t - C_t \mu_{t|t-1}$ , called the innovation, decreases as the observation and prediction agree more, making $\mu_{t|t}$ close to $\mu_{t|t-1}$ . Also, $K_t$ is the Kalman gain, which compares the covariance with and without measurement noise $R_t$ , and ranges from $0$ to $C_t$ as the noise decreases. When the noise is large or $\Sigma_{t|t-1}$ is small, $K_t$ approaches $0$ , making $\mu_{t|t}$ and $\Sigma_{t|t}$ depend more on the reliable results from the prediction step ( $\mu_{t|t-1}$ and $\Sigma_{t|t-1})$ instead of $z_t$ , which presumably has high noise. On the other hand, when the noise is small or $\Sigma_{t|t-1}$ is large, $K_t$ approaches $C_t^{-1}$ , making $\mu_{t|t}$ depend more on the inverted observation ( $C_t^{-1}\mu_{t|t}$ ) and making $\Sigma_{t|t}$ approach zero due to the use of more reliable observation. Here, we can notice that the covariance always decreases after the update step as we incorporate the observation, and we alternate between increasing the covariance in the prediction step and decreasing it in the update step as we perform those steps recursively.

Extended Kalman Filter (EKF) & Particle Filter

Although the assumptions of linear-Gaussian systems are powerful, many real-world scenarios violate them. For systems with non-linear processes (state transition and measurement), we can use the Extended Kalman Filter (EKF) instead, which performs a first-order Taylor approximation around $\mu$ to linearize the non-linear functions about $x_{t-1}$ , ensuring that the predicted posteriors remain Gaussian. Specifically, when the state transition model is $g(u_t, x_{t-1}) + w_t$ and the measurement model is $h(x_t) + v_t$ , where $g$ and $h$ are non-linear, we perform the following linear approximation.

G_t = \frac{\partial g(u_t, \mu_{t-1|t-1})}{\partial \mu_{t-1|t-1}} \\ H_t = \frac{d h(\mu_{t|t-1})}{d \mu_{t|t-1}} \\ g(u_t, x_{t-1}) = g(u_t, \mu_{t-1 | t-1}) + G_t(x_{t-1} - \mu_{t-1|t-1}) \\ h(x_{t}) = h(\mu_{t | t-1}) + H_t(x_{t} - \mu_{t|t-1})

Here, $G_t$ and $H_t$ are Jacobian matrices containing the respective gradients. Using this, we can substitute the linear parts of mean estimation in the prediction and update steps ( $A_{t-1}x_{t-1}$ and $C_t \mu_{t|t-1}$ ) with the non-linear functions ( $g(u_t, \mu_{t-1|t-1})$ and $h(\mu_{t|t-1})$ ), and replace $A_{t-1}$ and $C_t$ (along with their associated covariance and Kalman gain estimations) with $G_{t-1}$ and $H_t$ in both steps. When the random variables are non-Gaussian, we can use the non-parametric approach of a particle filter, which relies on sampling, unlike the parametric approach of the Kalman filter, which tracks the parameters of the distributions. Specifically, a particle filter samples particles from the state transition model (propagating particles from the previous iteration through the model), computes the importance weights of the particles using the measurement model, and resamples the particles based on the importance weights (see the resources cited at the bottom of the article for more details).

Differentiable Filters

While Kalman filters and other Bayes filters are highly performant in many contexts, including feature tracking (ByteTrack by Zhang, Y. et al. (2021) achieved state-of-the-art performance on feature tracking at the time by using a simple Kalman filter on both high and low confidence boxes), they all share the assumption that noise parameters and state transition and measurement models are easily accessible ( $A_t, B_t, C_t, Q_t, R_t$ ). This is often not the case in real-world scenarios. Hence, researchers have developed deep learning approaches that aim to learn these parameters from training datasets. One such approach is Backprop KF (BKF), which embeds the structure of a Kalman filter within a recurrent unit.

As shown above, BKF has an encoder that processes the raw observation into an intermediate observation $z_t$ and a matrix $L_t$ . $L_t$ is used to compute the covariance matrix for measurement noise $R_t$ . Then, another recurrent part of BKF takes a prediction step using $\mu_{t-1|t-1}$ from the previous unit and learnable embeddings $A$ , and undergoes an update step using $\Sigma_{t-1|t-1}$ and other learnable embeddings. The model is trained using the L2 loss between the inverted state label $C \mu_{t|t}$ and the label $y_t$ . BKF achieved better performance than other recurrent units, demonstrating the effectiveness of probabilistic deep learning approaches (as also demonstrated by diffusion models and reinforcement learning). There is also a deep learning approach to particle filters, whose details can be viewed in the original paper cited at the bottom of the article.

Conclusion

In this article, we discussed optical and scene flow for apparent 2D and 3D motion displacements, and Bayes filters and differentiable filters, which are effective for feature tracking. For both optical flow and Bayes filters, we discussed how deep learning can work together or improve the analytical approach by overcoming its limitations. For more details on each topic covered in this article, I recommend checking out the resources cited below.

Resources

Chiang, B. & Bohg, J. 2022. Optical and Scene Flow. Stanford.
Chiang, B. & Bohg, J. 2022. Optimal Estimation. Stanford.
Haarnoja, T. et al. 2016. Backprop KF: Learning Discriminative Deterministic State Estimators. ArXiv.
Han, J. 2024. The Particle Filter: A Full Tutorial. YouTube.
Jonschkowski, R. 2018. Differentiable Particle Filters: End-to-End Learning with Algorithmic Priors. ArXiv.
Savarese, S. & Bohg, J. 2025. Lecture 12 Optical and Scene Flow. Stanford.
Savarese, S. & Bohg, J. 2025. Lecture 13 Optimal Estimation. Stanford.
Savarese, S. & Bohg, J. 2025. Lecture 14 Optimal Estimation Cont'. Stanford.
Zhang, Y. et al. 2021. ByteTrack: Multi-Object Tracking by Associating Every Detection Box. ArXiv.