Road to ML Engineer #61 - Epipolar Geometry

Last Edited: 6/12/2025

The blog post introduces epipolar geometry in computer vision.

ML

In the previous article, we covered single-view metrology and unveiled its effectiveness in estimating orientations of objects and its limitations in estimating scales and positions. Although I also briefly mentioned in the previous article that the inherent limitations apply to humans as well, this is only partially correct. They apply when we look at an image already captured by a camera, but not when we use our two eyes to capture more information and interpret the 3D world. In this article, we will mathematically analyze how having two eyes or cameras is helpful in understanding more about the 3D world.

Epipolar Geometry

Epipolar geometry describes the relationships between two cameras (stereo vision), 3D points, and the projections of those points on the image planes. This helps us understand why having additional viewpoints uncovers additional information. The figure below shows the general setup of epipolar geometry involving two cameras observing the same 3D point PP, projected to pp and pp' respectively.

Epipolar Geometry Setup

O1O_1 and O2O_2 are the locations of the camera centers, and the line between them is the baseline with distance BB. The plane containing the two camera centers and PP is called the epipolar plane. The intersections of the epipolar plane with the two image planes are called epipolar lines. By definition, the epipolar lines contain the projections of PP and the epipoles ee and ee', which are the intersections between the baseline and the image planes. In this setup, we are mainly concerned with first finding the corresponding projections pp and pp' by discovering the relationships between them and using those projections for estimating the position PP and finding the camera parameters (for depth estimation and 3D reconstruction).

Essential & Fundamental Matrices

In this setup, we can further define camera projection matrices MM and MM', corresponding to O1O_1 and O2O_2, which can be expressed as M=K[I   0]M=K[I~~~0] and M=K[RT   RTT]M'=K'[R^T~~~-R^T T] assuming that the world coordinate system is set to the first camera and the second camera is offset by rotation RR and translation TT. To unveil the relationship between the corresponding projections, we can temporarily simplify the setup by assuming canonical cameras where K=K=IK=K'=I. Here, we can express pp' in the first camera system as Rp+TRp'+T, and since both pp' and TT lie in the epipolar plane, taking the cross product T×(Rp+T)=T×RpT \times (Rp'+T) = T \times Rp' yields a vector normal to the epipolar plane.

a×b=[0azayaz0axayax0][bxbybz]=[a×]b a \times b = \begin{bmatrix} 0 & -a_z & a_y \\ a_z & 0 & -a_x \\ -a_y & a_x & 0 \\ \end{bmatrix} \begin{bmatrix} b_x \\ b_y \\ b_z \\ \end{bmatrix} = [a_{\times}]b

As pp also lies in the epipolar plane, we can arrive at the epipolar constraint pT[T×Rp]=0p^T[T \times Rp']=0. From linear algebra, we can use an alternative expression of the cross product using matrix multiplication (shown above) to arrive at pTKT[T×]RK1p=pTEp=0p^TK^{-T}[T_{\times}]RK'^{-1}p'=p^TEp'=0, where EE is the essential matrix with 5 degrees of freedom (rotation and translation) and rank 2 (a singular matrix mapping a 3D plane to 2D lines), which is useful for computing the epipolar lines. For example, we can obtain the epipolar line of O2O_2 using =ETp\ell'=E^Tp and the epipolar line of O1O_1 using =Ep\ell=Ep'. Also, the epipoles are mapped to an epipolar point or zero vector where Ee=Ee=0Ee=Ee'=0.

Then, we can bring back the camera matrices KK by defining pc=K1pp_c=K^{-1}p and pc=K1pp_c'=K'^{-1}p', resulting in pcT[T×]Rpc=0p_c^T[T_{\times}]Rp'_c=0. From here, we can substitute back pp and pp' to arrive at pTKT[T×]RK1p=pTFp=0p^TK^{-T}[T_{\times}]RK'^{-1}p'=p^TFp'=0, and obtain the fundamental matrix F=KT[T×]RK1F=K^{-T}[T_{\times}]RK'^{-1}, which is the general version of the essential matrix with 7 degrees of freedom (2 for camera matrices) and rank 2 (a singular matrix mapping a 3D plane to 2D lines). Similarly to the essential matrix, the fundamental matrix can also be used to obtain epipolar lines of O1O_1 and O2O_2 using =Fp\ell=Fp' and =FTp\ell'=F^Tp and is useful for estimating the projection pp' from a different viewpoint without knowing PP solely by the constraint (view morphing and 3D reconstruction).

Eight-Point Algorithm

Although FF is useful to some extent, we cannot simply assume that we have access to it. However, it is possible to estimate FF from the two images of the same scene using the eight-point algorithm. We can identify 8 pairs of corresponding points pp and pp', and formulate a system of equations to estimate F^\hat{F} using Singular Value Decomposition (SVD), which may have full rank. To obtain an FF that is singular and has rank 2, we can perform a rank 2 approximation of F^\hat{F} under the constraint det(F)=0\text{det}(F) = 0 (the determinant of a singular matrix is 0) also using SVD.

Though the eight-point algorithm seems to work well in theory, it tends to have large errors in practice, especially when using modern cameras with large pixel ranges and when the selected points are relatively close together, resulting in large and similar pip_i and pip_i'. To address this issue, we often use the normalized eight-point algorithm, where points pp and pp' are normalized by matrices TT and TT' (translation and scaling) and used to estimate FqF_q (where qq represents the normalized pp). This can then be denormalized to obtain a quality estimate of FF (F=TTFqTF = T'^T F_q T).

Parallel Images

Although the essential and fundamental matrices for the general case are already useful, they become far simpler when the two image planes are parallel. Parallelism eliminates rotation RR from the matrices and reduces the constraint to v=vv = v', where p=(u,v,1)Tp = (u, v, 1)^T and p=(u,v,1)Tp' = (u', v', 1)^T. The simplicity afforded by parallel images is useful for various problems in epipolar geometry (specifics will be discussed later), so we would prefer to acquire parallel images in advance or even correct the images to be parallel via a process called rectification.

Rectification is the problem of finding corresponding projective transformations HH for the image planes that can render them parallel. We can then apply these transformations to obtain parallel images, which are useful for various problems. Unfortunately, the details of rectification are outside the scope of this article, but it can be framed as an optimization problem involving samples of corresponding points pp and pp' and solved using any suitable method.

Triangulation

Aside from the relationships between the corresponding points, the setup allows us to infer the position or depth of PP, which was simply not possible with monocular vision alone. We can formulate PP as the intersection of the extended lines of pp and pp' and infer the position of PP by finding PP^* that minimizes the total distance between MPMP^* and pp and MPM'P^* and pp' when camera parameters are known. This process of estimating the position of PP from the corresponding pp and pp' is called triangulation.

Triangulation

Triangulation is difficult for non-parallel pp and pp' to some extent, but is much easier for parallel images, since we know that v=vv=v' for parallel images and the position can be determined only by comparing uu and uu'. Specifically, we can determine that the disparity, the distance between uu and uu', correlates with Bfz\frac{Bf}{z}, where BB is the baseline, ff is the focal length, and zz is the depth. This means that the further away an object is and the longer the baseline is, the less disparity we will observe. This explains why we (and many predators) have two eyes placed forward-facing and relatively parallel. They ensure stereo vision and allow us to perform relative depth estimation via triangulation and do so relatively easily using disparity (while prey eyes prioritize panoramic view for spotting potential danger).

Correspondence Problem

We can notice here that all of the processes we have discussed so far, the eight-point algorithm for estimating FF, rectification to create parallel images, and triangulation for depth estimation—involve knowing corresponding projections pp and pp'. However, it is not as trivial for computers, and sometimes even for us, to determine these correspondences. For example, we can devise an algorithm that compares the aggregate of neighboring pixels to find corresponding points. However, it will inevitably struggle with images containing homogenous regions or repetitive patterns, and with images exhibiting different brightness and exposure. To address this issue, we need to implement an algorithm that performs both local and global comparisons at a high level.

Correspondence Issues

In addition to such difficulties, the correspondence problem has inherent limitations, such as occlusions and foreshortening, visualized above. These limitations are more pronounced when we are presented with a larger Bz\frac{B}{z} ratio. However, a small Bz\frac{B}{z} implies a large error in depth estimation for small measurement errors, and it also implies an extremely small disparity for parallel cameras, which presents us with a baseline tradeoff, a longer baseline strengthens the limitations of occlusions and foreshortening, while a shorter baseline implies larger error and difficulty in depth estimation. This explains why humans, with a fixed and relatively small baseline length, need to use our knowledge of object size to infer depth especially for far-away objects, even with stereo vision, and sometimes have difficulty estimating the depth of far-away objects with unfamiliar shapes and sizes.

Conclusion

In this article, we discussed how introducing a second viewpoint gives rise to epipolar geometry, which opens up the opportunity for defining many relationships between variables and performing view morphing and depth estimation from corresponding points to better understand the 3D world. However, we also discovered that all of this is possible only by first obtaining the corresponding points, which might be complicated and limited to some extent, and we related the concepts to how our vision works. In the next article, we will discuss the potential ways of overcoming those difficulties and limitations.

Resources