Road to ML Engineer #61 - Epipolar Geometry

In the previous article, we covered single-view metrology and unveiled its effectiveness in estimating orientations of objects and its limitations in estimating scales and positions. Although I also briefly mentioned in the previous article that the inherent limitations apply to humans as well, this is only partially correct. They apply when we look at an image already captured by a camera, but not when we use our two eyes to capture more information and interpret the 3D world. In this article, we will mathematically analyze how having two eyes or cameras is helpful in understanding more about the 3D world.

Epipolar Geometry

Epipolar geometry describes the relationships between two cameras (stereo vision), 3D points, and the projections of those points on the image planes. This helps us understand why having additional viewpoints uncovers additional information. The figure below shows the general setup of epipolar geometry involving two cameras observing the same 3D point $P$ , projected to $p$ and $p'$ respectively.

$O_1$ and $O_2$ are the locations of the camera centers, and the line between them is the baseline with distance $B$ . The plane containing the two camera centers and $P$ is called the epipolar plane. The intersections of the epipolar plane with the two image planes are called epipolar lines. By definition, the epipolar lines contain the projections of $P$ and the epipoles $e$ and $e'$ , which are the intersections between the baseline and the image planes. In this setup, we are mainly concerned with first finding the corresponding projections $p$ and $p'$ by discovering the relationships between them and using those projections for estimating the position $P$ and finding the camera parameters (for depth estimation and 3D reconstruction).

Essential & Fundamental Matrices

In this setup, we can further define camera projection matrices $M$ and $M'$ , corresponding to $O_1$ and $O_2$ , which can be expressed as $M=K[I~~~0]$ and $M'=K'[R^T~~~-R^T T]$ assuming that the world coordinate system is set to the first camera and the second camera is offset by rotation $R$ and translation $T$ . To unveil the relationship between the corresponding projections, we can temporarily simplify the setup by assuming canonical cameras where $K=K'=I$ . Here, we can express $p'$ in the first camera system as $Rp'+T$ , and since both $p'$ and $T$ lie in the epipolar plane, taking the cross product $T \times (Rp'+T) = T \times Rp'$ yields a vector normal to the epipolar plane.

a \times b = \begin{bmatrix} 0 & -a_z & a_y \\ a_z & 0 & -a_x \\ -a_y & a_x & 0 \\ \end{bmatrix} \begin{bmatrix} b_x \\ b_y \\ b_z \\ \end{bmatrix} = [a_{\times}]b

As $p$ also lies in the epipolar plane, we can arrive at the epipolar constraint $p^T[T \times Rp']=0$ . From linear algebra, we can use an alternative expression of the cross product using matrix multiplication (shown above) to arrive at $p^TK^{-T}[T_{\times}]RK'^{-1}p'=p^TEp'=0$ , where $E$ is the essential matrix with 5 degrees of freedom (rotation and translation) and rank 2 (a singular matrix mapping a 3D plane to 2D lines), which is useful for computing the epipolar lines. For example, we can obtain the epipolar line of $O_2$ using $\ell'=E^Tp$ and the epipolar line of $O_1$ using $\ell=Ep'$ . Also, the epipoles are mapped to an epipolar point or zero vector where $Ee=Ee'=0$ .

Then, we can bring back the camera matrices $K$ by defining $p_c=K^{-1}p$ and $p_c'=K'^{-1}p'$ , resulting in $p_c^T[T_{\times}]Rp'_c=0$ . From here, we can substitute back $p$ and $p'$ to arrive at $p^TK^{-T}[T_{\times}]RK'^{-1}p'=p^TFp'=0$ , and obtain the fundamental matrix $F=K^{-T}[T_{\times}]RK'^{-1}$ , which is the general version of the essential matrix with 7 degrees of freedom (2 for camera matrices) and rank 2 (a singular matrix mapping a 3D plane to 2D lines). Similarly to the essential matrix, the fundamental matrix can also be used to obtain epipolar lines of $O_1$ and $O_2$ using $\ell=Fp'$ and $\ell'=F^Tp$ and is useful for estimating the projection $p'$ from a different viewpoint without knowing $P$ solely by the constraint (view morphing and 3D reconstruction).

Eight-Point Algorithm

Although $F$ is useful to some extent, we cannot simply assume that we have access to it. However, it is possible to estimate $F$ from the two images of the same scene using the eight-point algorithm. We can identify 8 pairs of corresponding points $p$ and $p'$ , and formulate a system of equations to estimate $\hat{F}$ using Singular Value Decomposition (SVD), which may have full rank. To obtain an $F$ that is singular and has rank 2, we can perform a rank 2 approximation of $\hat{F}$ under the constraint $\text{det}(F) = 0$ (the determinant of a singular matrix is 0) also using SVD.

Though the eight-point algorithm seems to work well in theory, it tends to have large errors in practice, especially when using modern cameras with large pixel ranges and when the selected points are relatively close together, resulting in large and similar $p_i$ and $p_i'$ . To address this issue, we often use the normalized eight-point algorithm, where points $p$ and $p'$ are normalized by matrices $T$ and $T'$ (translation and scaling) and used to estimate $F_q$ (where $q$ represents the normalized $p$ ). This can then be denormalized to obtain a quality estimate of $F$ ( $F = T'^T F_q T$ ).

Parallel Images

Although the essential and fundamental matrices for the general case are already useful, they become far simpler when the two image planes are parallel. Parallelism eliminates rotation $R$ from the matrices and reduces the constraint to $v = v'$ , where $p = (u, v, 1)^T$ and $p' = (u', v', 1)^T$ . The simplicity afforded by parallel images is useful for various problems in epipolar geometry (specifics will be discussed later), so we would prefer to acquire parallel images in advance or even correct the images to be parallel via a process called rectification.

Rectification is the problem of finding corresponding projective transformations $H$ for the image planes that can render them parallel. We can then apply these transformations to obtain parallel images, which are useful for various problems. Unfortunately, the details of rectification are outside the scope of this article, but it can be framed as an optimization problem involving samples of corresponding points $p$ and $p'$ and solved using any suitable method.

Triangulation

Aside from the relationships between the corresponding points, the setup allows us to infer the position or depth of $P$ , which was simply not possible with monocular vision alone. We can formulate $P$ as the intersection of the extended lines of $p$ and $p'$ and infer the position of $P$ by finding $P^*$ that minimizes the total distance between $MP^*$ and $p$ and $M'P^*$ and $p'$ when camera parameters are known. This process of estimating the position of $P$ from the corresponding $p$ and $p'$ is called triangulation.

Triangulation is difficult for non-parallel $p$ and $p'$ to some extent, but is much easier for parallel images, since we know that $v=v'$ for parallel images and the position can be determined only by comparing $u$ and $u'$ . Specifically, we can determine that the disparity, the distance between $u$ and $u'$ , correlates with $\frac{Bf}{z}$ , where $B$ is the baseline, $f$ is the focal length, and $z$ is the depth. This means that the further away an object is and the longer the baseline is, the less disparity we will observe. This explains why we (and many predators) have two eyes placed forward-facing and relatively parallel. They ensure stereo vision and allow us to perform relative depth estimation via triangulation and do so relatively easily using disparity (while prey eyes prioritize panoramic view for spotting potential danger).

Correspondence Problem

We can notice here that all of the processes we have discussed so far, the eight-point algorithm for estimating $F$ , rectification to create parallel images, and triangulation for depth estimation—involve knowing corresponding projections $p$ and $p'$ . However, it is not as trivial for computers, and sometimes even for us, to determine these correspondences. For example, we can devise an algorithm that compares the aggregate of neighboring pixels to find corresponding points. However, it will inevitably struggle with images containing homogenous regions or repetitive patterns, and with images exhibiting different brightness and exposure. To address this issue, we need to implement an algorithm that performs both local and global comparisons at a high level.

In addition to such difficulties, the correspondence problem has inherent limitations, such as occlusions and foreshortening, visualized above. These limitations are more pronounced when we are presented with a larger $\frac{B}{z}$ ratio. However, a small $\frac{B}{z}$ implies a large error in depth estimation for small measurement errors, and it also implies an extremely small disparity for parallel cameras, which presents us with a baseline tradeoff, a longer baseline strengthens the limitations of occlusions and foreshortening, while a shorter baseline implies larger error and difficulty in depth estimation. This explains why humans, with a fixed and relatively small baseline length, need to use our knowledge of object size to infer depth especially for far-away objects, even with stereo vision, and sometimes have difficulty estimating the depth of far-away objects with unfamiliar shapes and sizes.

Conclusion

In this article, we discussed how introducing a second viewpoint gives rise to epipolar geometry, which opens up the opportunity for defining many relationships between variables and performing view morphing and depth estimation from corresponding points to better understand the 3D world. However, we also discovered that all of this is possible only by first obtaining the corresponding points, which might be complicated and limited to some extent, and we related the concepts to how our vision works. In the next article, we will discuss the potential ways of overcoming those difficulties and limitations.

Resources

Hata, K. & Savarese, S. 2025. CS231A Course Notes 3: Epipolar Geometry. Stanford.
Savarese, S. & Bohg, J. 2025. Lecture 5 Epipolar Geometry. Stanford.
Savarese, S. & Bohg, J. 2025. Lecture 6 Stereo Systems Multi-view Geometry. Stanford.