Road to ML Engineer #66 - Feature Tracking

Last Edited: 7/10/2025

The blog post introduces feature tracking in computer vision.

ML

In the previous articles, we discussed how deep learning models can tackle limitations of single-view metrology and epipolar geometry and perform sufficiently well on monocular depth estimation and stereo matching. Hence, in this article, we will discuss feature tracking, which tracks meaningful features (typically objects and humans) from a video and is one of the computer vision tasks that correspond to structure from motion.

Optical & Scene Flow

One useful quantity for various computer vision tasks involving video is optical flow, which is the 2D vector field describing the apparent motion displacement of all image pixels. It should not be confused with the motion field, which is the 2D vector field describing the 2D projection of actual movement of 3D points in the scene. We can mathematically represent the apparent velocity of a pixel (x,y)(x, y) in the x and y axes as u(x,y,t)=ΔxΔtu(x, y, t)=\frac{\Delta x}{\Delta t} and v(x,y,t)=ΔyΔtv(x, y, t)=\frac{\Delta y}{\Delta t}, respectively. These can then be used to express the optical flow vector of a pixel as u=[u,v]T\vec{\text{u}} = [u, v]^T. The Lucas-Kanade method utilizes various assumptions to solve for the optical flow vectors for a pixel patch using least squares.

I(x,y,t)=I(x+Δx,y+Δy,t+Δt)I(x+Δx,y+Δy,t+Δt)=I(x,y,t)+IxΔx+IyΔy+ItΔt...I(x,y,t)+IxΔx+IyΔy+ItΔt0=IxΔx+IyΔy+ItΔt=Ixu+Iyv+It I(x, y, t) = I(x+\Delta x, y+\Delta y, t+\Delta t) \\ I(x+\Delta x, y+\Delta y, t+\Delta t) = I(x, y, t) + \frac{\partial I}{\partial x}\Delta x + \frac{\partial I}{\partial y}\Delta y + \frac{\partial I}{\partial t}\Delta t ... \\ \approx I(x, y, t) + \frac{\partial I}{\partial x}\Delta x + \frac{\partial I}{\partial y}\Delta y + \frac{\partial I}{\partial t}\Delta t \\ 0 = \frac{\partial I}{\partial x}\Delta x + \frac{\partial I}{\partial y}\Delta y + \frac{\partial I}{\partial t}\Delta t \\ = I_x u + I_y v + I_t

The first two equations above are consequences of the brightness constancy assumption and the small motion assumption, respectively. The brightness constancy assumption assumes that the intensity of pixels belonging to the same object does not change across frames, and the small motion assumption assumes that the motion is small and can therefore be approximated by a first-order Taylor approximation. The subsequent derivations utilize these two equations to obtain the optical flow equation, which can be further rearranged into a linear system, It=Ixu+Iyv=Iu-I_t = I_x u + I_y v = \nabla I \cdot \vec{\text{u}}, where I=[Ix,Iy]T\nabla I=[I_x, I_y]^T represents the spatial gradient or edges. Although it appears we can solve this linear system, we only have one constraint and two unknowns, which leaves us to solve for the normal flow, given by ItI\frac{-I_t}{||\nabla I||}. This is why we can easily determine the pixel flow in the direction normal to edges, but it's difficult to obtain flow parallel to edges.

To mitigate this limitation, the Lucas-Kanade method takes a semi-local approach with a spatial smoothness assumption, which assumes that neighboring pixels belong to the same surface and share the optical flow vector u\vec{\text{u}}, allowing us to set up multiple constraints in various directions. Specifically, we end up with It(pi)=I(pi)u-I_t(p_i)=\nabla I(p_i) \cdot \vec{\text{u}} for each of the N2N^2 neighboring pixels pi=[xi,yi]Tp_i=[x_i, y_i]^T. Although increasing the number of pixels and thus constraints generally helps, we can still end up with an ambiguous optical flow vector when the pixel patch has gradients in the same direction and has low texture. Therefore, there have been attempts to use deep learning models to overcome these limitations of the Lucas-Kanade method and obtain high-quality optical flow, which is useful for feature tracking. The optical flow model can also be utilized for scene flow estimation, where we infer the 3D motion field of every point in the scene from the frames for subsequent tasks like autonomous driving and robotics.

Bayes Filter

Though optical flow and scene flow may be useful for tracking apparent and short movements of an object, they are not suitable for feature tracking involving occlusions and predictions based on past trajectories and motion patterns, which is often required in autonomous driving and robotics. Hence, we can use a partially observable Markov decision process (POMDP) to model feature tracking with long-term dependence. Here, control inputs UU may represent the camera's position and motion, states XX represent the object's true positions, velocities, and so on, and observations ZZ are the results of object detections. (If you are unfamiliar with MDPs and POMDPs, I recommend checking out the RL subseries of this ML series.) The goal is to estimate the distributions of the current state xtx_t for all tt, called posteriors, based on all other known variables.

POMDP

In performing state estimation, we can assume that the state transition model p(xtxt1,ut)p(x_t | x_{t-1}, u_t) (where we assume a Markov state) and the measurement model p(ztxt)p(z_t | x_t) (where we assume observation only depends on the previous state) are easily obtainable. Then, we can derive a recursive expression for the state posterior p(xtz1:t,u1:t)p(x_t|z_{1:t}, u_{1:t}) that depends only on the previous posterior, the current observation ztz_t, and the control input utu_t. Specifically, we can utilize Bayes' rule and conditional independence of observations to arrive at the expression for p(xtz1:t,u1:t)p(x_t|z_{1:t}, u_{1:t}) and leverage the Markov assumption to determine the expression for the prior or predicted current posterior, p(xtz1:t1,u1:t)p(x_{t}|z_{1:t-1}, u_{1:t}), using only the previous posterior, the transition model (which depends on the previous state and control input), and the measurement model (and a normalization factor).

p(xtz1:t,u1:t)=p(ztxt)p(xtz1:t1,u1:t)p(ztz1:t1,u1:t)p(xtz1:t1,u1:t)=xt1p(xt,xt1z1:t1,u1:t)dxt1=xt1p(xtxt1,z1:t1,u1:t)p(xt1z1:t1,u1:t)dxt1=xt1p(xtxt1,u1:t)p(xt1z1:t1,u1:t)dxt1 p(x_t|z_{1:t}, u_{1:t}) = \frac{p(z_t | x_t)p(x_{t}|z_{1:t-1}, u_{1:t})}{p(z_{t}|z_{1:t-1}, u_{1:t})} \\ p(x_{t}|z_{1:t-1}, u_{1:t}) = \int_{x_{t-1}} p(x_t, x_{t-1} | z_{1:t-1}, u_{1:t}) dx_{t-1} \\ = \int_{x_{t-1}} p(x_t | x_{t-1}, z_{1:t-1}, u_{1:t})p(x_{t-1} | z_{1:t-1}, u_{1:t}) dx_{t-1} \\ = \int_{x_{t-1}} p(x_t | x_{t-1}, u_{1:t})p(x_{t-1} | z_{1:t-1}, u_{1:t}) dx_{t-1}

The above derivations unveil the recursive relationships of state posteriors. To determine the current state posterior, we can first compute the prior or predicted current posterior (without observation ztz_t) by taking the integral of the product between the state transition model and the previous state posterior (2nd equation). Then, we can filter the predicted current posterior by multiplying it with the measurement model and dividing it by a normalization constant (1st equation). The filtered current state posterior can then be used as the previous prior in the next prediction for t+1t+1. The first step, predicting the current posterior without a new observation, is called the prediction step, the second step, incorporating the new observation and filtering the prediction of the current posterior, is called the update step, and the recursive application of these steps involving Bayes' rule for performing state estimation is called the Bayes filter.

Kalman Filter

Although the Bayes filter is simple and generalizes to any probability distribution, we often cannot compute the normalization constant, which requires knowing the entire probability distribution. Hence, we often assume a specific kind of system and adopt a Bayes filter approach to those systems in practice. One common practical adaptation of the Bayes filter is the Kalman filter, which assumes linear-Gaussian systems with noise. Linear-Gaussian systems assume that all random variables (states, observations, inputs, noises) are multivariate Gaussians, and the state transition and measurement models are linear as follows.

xt=Atxt1+But+wtzt=Ctxt+vt x_t = A_t x_{t-1} + B u_{t} + w_t \\ z_t = C_t x_t + v_t

Here, AtA_t, BtB_t, and CtC_t are assumed to be known, and wtN(0,Qt)w_t \sim N(0, Q_t) and vtN(0,Rt)v_t \sim N(0, R_t) are random Gaussian noises, called process and measurement noises, respectively, where QtQ_t and RtR_t are also known. By assuming all random variables are multivariate Gaussians, we can only keep track of means and covariances to represent the entire probability distributions. Also, the linearity of the system guarantees that subsequent variables remain Gaussian, making it practical to use in the real world (many object movements can be described as linear transformations). Here, the initial state x0x_0 is x0N(μ00,Σ00)x_0 \sim N(\mu_{0|0}, \Sigma_{0|0}), where the mean and covariance are known, and we would like the state posterior p(xtz1:t,u1:t)N(μtt,Σtt)p(x_t | z_{1:t}, u_{1:t}) \sim N(\mu_{t|t}, \Sigma_{t|t}) for all tt. Using these models and assumptions, we can derive the prediction step for obtaining the predicted current posterior without observation p(xtz1:t1,u1:t)N(μtt1,Σtt1)p(x_t | z_{1:t-1}, u_{1:t}) \sim N(\mu_{t|t-1}, \Sigma_{t|t-1}) from the state transition as follows.

μtt1=Atμt1t1+ButΣtt1=At1Σt1t1At1T+Qt1 \mu_{t|t-1} = A_t \mu_{t-1 | t-1} + B u_{t} \\ \Sigma_{t|t-1} = A_{t-1} \Sigma_{t-1|t-1} A_{t-1}^T + Q_{t-1}

The first equation does not include wtw_t because it has zero mean and does not affect μtt1\mu_{t|t-1}. The second equation uses At1Σt1t1A_{t-1} \Sigma_{t-1|t-1} to account for the covariance, which is represented as Cov(Ax)=AΣAT\text{Cov}(Ax) = A \Sigma A^T. Since all covariance matrices are positive definite, Σtt1\Sigma_{t|t-1} is always larger than Σt1t1\Sigma_{t-1|t-1}. It makes sense intuitively as well that the covariance gets larger after the prediction step because we are making a prediction on the current posterior without using the current observation ztz_t unlike the previous posterior that takes the previous observation into account. In the next update step, we can use μtt1\mu_{t|t-1}, Σtt1\Sigma_{t|t-1}, and the measurement model to filter the current posterior as follows.

μtt=μtt1+Kt(ztCtμtt1)Σtt=Σtt1KtCtΣtt1Kt=Σtt1CtT(CtΣtt1CtT+Rt)1(=Ct1CtΣtt1CtTCtΣtt1CtT+Rt) \mu_{t|t} = \mu_{t|t-1} + K_t(z_t - C_t \mu_{t|t-1}) \\ \Sigma_{t|t} = \Sigma_{t|t-1} - K_t C_t \Sigma_{t|t-1} \\ K_t = \Sigma_{t|t-1} C_t^T (C_t \Sigma_{t|t-1} C_t^T + R_t)^{-1} (= C_t^{-1} \frac{C_t \Sigma_{t|t-1} C_t^T}{C_t \Sigma_{t|t-1} C_t^T+R_t})

Here, ztCtμtt1z_t - C_t \mu_{t|t-1}, called the innovation, decreases as the observation and prediction agree more, making μtt\mu_{t|t} close to μtt1\mu_{t|t-1}. Also, KtK_t is the Kalman gain, which compares the covariance with and without measurement noise RtR_t, and ranges from 00 to CtC_t as the noise decreases. When the noise is large or Σtt1\Sigma_{t|t-1} is small, KtK_t approaches 00, making μtt\mu_{t|t} and Σtt\Sigma_{t|t} depend more on the reliable results from the prediction step (μtt1\mu_{t|t-1} and Σtt1)\Sigma_{t|t-1}) instead of ztz_t, which presumably has high noise. On the other hand, when the noise is small or Σtt1\Sigma_{t|t-1} is large, KtK_t approaches Ct1C_t^{-1}, making μtt\mu_{t|t} depend more on the inverted observation (Ct1μttC_t^{-1}\mu_{t|t}) and making Σtt\Sigma_{t|t} approach zero due to the use of more reliable observation. Here, we can notice that the covariance always decreases after the update step as we incorporate the observation, and we alternate between increasing the covariance in the prediction step and decreasing it in the update step as we perform those steps recursively.

Extended Kalman Filter (EKF) & Particle Filter

Although the assumptions of linear-Gaussian systems are powerful, many real-world scenarios violate them. For systems with non-linear processes (state transition and measurement), we can use the Extended Kalman Filter (EKF) instead, which performs a first-order Taylor approximation around μ\mu to linearize the non-linear functions about xt1x_{t-1}, ensuring that the predicted posteriors remain Gaussian. Specifically, when the state transition model is g(ut,xt1)+wtg(u_t, x_{t-1}) + w_t and the measurement model is h(xt)+vth(x_t) + v_t, where gg and hh are non-linear, we perform the following linear approximation.

Gt=g(ut,μt1t1)μt1t1Ht=dh(μtt1)dμtt1g(ut,xt1)=g(ut,μt1t1)+Gt(xt1μt1t1)h(xt)=h(μtt1)+Ht(xtμtt1) G_t = \frac{\partial g(u_t, \mu_{t-1|t-1})}{\partial \mu_{t-1|t-1}} \\ H_t = \frac{d h(\mu_{t|t-1})}{d \mu_{t|t-1}} \\ g(u_t, x_{t-1}) = g(u_t, \mu_{t-1 | t-1}) + G_t(x_{t-1} - \mu_{t-1|t-1}) \\ h(x_{t}) = h(\mu_{t | t-1}) + H_t(x_{t} - \mu_{t|t-1})

Here, GtG_t and HtH_t are Jacobian matrices containing the respective gradients. Using this, we can substitute the linear parts of mean estimation in the prediction and update steps (At1xt1A_{t-1}x_{t-1} and Ctμtt1C_t \mu_{t|t-1}) with the non-linear functions (g(ut,μt1t1)g(u_t, \mu_{t-1|t-1}) and h(μtt1)h(\mu_{t|t-1})), and replace At1A_{t-1} and CtC_t (along with their associated covariance and Kalman gain estimations) with Gt1G_{t-1} and HtH_t in both steps. When the random variables are non-Gaussian, we can use the non-parametric approach of a particle filter, which relies on sampling, unlike the parametric approach of the Kalman filter, which tracks the parameters of the distributions. Specifically, a particle filter samples particles from the state transition model (propagating particles from the previous iteration through the model), computes the importance weights of the particles using the measurement model, and resamples the particles based on the importance weights (see the resources cited at the bottom of the article for more details).

Differentiable Filters

While Kalman filters and other Bayes filters are highly performant in many contexts, including feature tracking (ByteTrack by Zhang, Y. et al. (2021) achieved state-of-the-art performance on feature tracking at the time by using a simple Kalman filter on both high and low confidence boxes), they all share the assumption that noise parameters and state transition and measurement models are easily accessible (At,Bt,Ct,Qt,RtA_t, B_t, C_t, Q_t, R_t). This is often not the case in real-world scenarios. Hence, researchers have developed deep learning approaches that aim to learn these parameters from training datasets. One such approach is Backprop KF (BKF), which embeds the structure of a Kalman filter within a recurrent unit.

BKF

As shown above, BKF has an encoder that processes the raw observation into an intermediate observation ztz_t and a matrix LtL_t. LtL_t is used to compute the covariance matrix for measurement noise RtR_t. Then, another recurrent part of BKF takes a prediction step using μt1t1\mu_{t-1|t-1} from the previous unit and learnable embeddings AA, and undergoes an update step using Σt1t1\Sigma_{t-1|t-1} and other learnable embeddings. The model is trained using the L2 loss between the inverted state label CμttC \mu_{t|t} and the label yty_t. BKF achieved better performance than other recurrent units, demonstrating the effectiveness of probabilistic deep learning approaches (as also demonstrated by diffusion models and reinforcement learning). There is also a deep learning approach to particle filters, whose details can be viewed in the original paper cited at the bottom of the article.

Conclusion

In this article, we discussed optical and scene flow for apparent 2D and 3D motion displacements, and Bayes filters and differentiable filters, which are effective for feature tracking. For both optical flow and Bayes filters, we discussed how deep learning can work together or improve the analytical approach by overcoming its limitations. For more details on each topic covered in this article, I recommend checking out the resources cited below.

Resources