Road to ML Engineer #60 - Single View Metrology

In the last article, we covered the basics of camera models and homogenous coordinates and applied them to camera calibration. However, the camera calibration required sufficient $P$ to be known, which is unavailable in most real-world scenarios. In fact, we often would like to reason about $P$ . Hence, in this article, we will explore how properties of homogenous coordinates can be used to calibrate a camera and estimate $P$ from a single image or perform single view metrology.

2D Transformations

Before diving into single view metrology, we need to understand the various transformations in 2D. Isometric transformations are transformations that can be described by rotation and translation, and they preserve distances. In homogenous coordinates, we can express them using a matrix containing the rotation matrix $R$ in the top left, the translation vector $t$ in the top right, zeros in the bottom left, and 1 in the bottom right.

\begin{bmatrix} x' \\ y' \\ 1 \end{bmatrix} = \begin{bmatrix} SR & t \\ 0 & 1 \\ \end{bmatrix} \begin{bmatrix} x \\ y \\ 1 \end{bmatrix}, S = \begin{bmatrix} s & 0 \\ 0 & s \\ \end{bmatrix}

Similarity transformations include scaling transformations applied to isometric transformations, preserving shapes. Hence, a scaling matrix $S$ , containing a scalar factor $s$ in the diagonal matrix, can be multiplied by the rotation matrix in the isometric transformation matrix to arrive at the similarity transformation matrix. Here, we can realize that the isometric transformation is a special form of similarity transformation when $s=1$ .

\begin{bmatrix} x' \\ y' \\ 1 \end{bmatrix} = \begin{bmatrix} A & t \\ 0 & 1 \\ \end{bmatrix} \begin{bmatrix} x \\ y \\ 1 \end{bmatrix}

Affine transformations include shearing transformations applied to similarity transformations, resulting in linear transformations with a matrix $A$ and translation by $t$ . Affine transformations preserve points, straight lines, and parallelism. Hence, both isometric and similarity transformations are special forms of affine transformations. Projective transformations further generalize affine transformations by transforming additional dimensions using $v$ and $b$ as follows.

\begin{bmatrix} x' \\ y' \\ 1 \end{bmatrix} = \begin{bmatrix} A & t \\ v & b \\ \end{bmatrix} \begin{bmatrix} x \\ y \\ 1 \end{bmatrix}

As a result, projective transformations no longer preserve parallelism and instead preserve colinearity of points. This means lines will be mapped to lines with possibly different angles and lengths. A line in homogenous coordinates in 2D can be defined by $\ell=[a, b, c]^T$ , where the slope and y-intercept are captured by $-\frac{a}{b}$ and $-\frac{c}{b}$ , respectively. All points $p=[x,y,1]^T$ on the line satisfy $\ell^Tp=0$ . Since projective transformations preserve colinearity of points, they map $\ell$ to $\ell'=[a', b', c']^T$ with new slope and intercept.

Points & Lines at Infinity

When two lines, $\ell$ and $\ell'$ , intersect, the point of intersection $x$ must satisfy both $\ell^Tx=0$ and $\ell'^Tx=0$ , meaning $x$ must be orthogonal to both $\ell$ and $\ell'$ . We can use these orthogonalities to find $x=\ell \times \ell'$ , which constrains $x$ to be orthogonal to both $\ell$ and $\ell'$ by the definition of the cross product. Using this, we can compute the hypothetical intersection between the parallel lines $\ell=[a, b, c]^T$ and $\ell'=[a', b', c']^T$ , where the slopes are equal (i.e., $\frac{a}{b}=\frac{a'}{b'}$ meaning $ab'-ba'=0$ and $\frac{a'}{a}=\frac{b'}{b}$ ).

\begin{bmatrix} a \\ b \\ c \end{bmatrix} \times \begin{bmatrix} a' \\ b' \\ c' \end{bmatrix} = \begin{bmatrix} bc' - cb' \\ ca' - ac' \\ ab' - ba' \\ \end{bmatrix} = \begin{bmatrix} bc' - c(-\frac{ba'}{a}) \\ c(\frac{ab'}{b}) - ac' \\ 0 \\ \end{bmatrix} = \begin{bmatrix} (c'+c\frac{a'}{a})b \\ -(c'+c\frac{b'}{b})a \\ 0 \\ \end{bmatrix} \propto \begin{bmatrix} b \\ -a \\ 0 \\ \end{bmatrix} = x_{\infty}

We can see that the last entry is zero, meaning the two parallel lines intersect at infinity. We can define lines at infinity as well, where all the points at infinity (the intersection of parallel lines) lie. These lines are represented as $\ell_{\infty}=[0, 0, c]^T$ , where $c$ is an arbitrary value that can be simply set to 1. We find that projective transformations of points and lines at infinity do not necessarily map them to another points and lines at infinity due to the influence of $v$ , while affine transformations do. (This intuitively makes sense since parallel lines that construct a point at infinity will no longer be guaranteed to be parallel after projective transformations.)

Vanishing Points & Lines

In the 3D world, we need to introduce the concept of a plane, which can be expressed as $\Pi = [a, b, c, d]^T$ , where $(a, b, c)$ forms the normal vector $n$ of the plane and $d$ represents the distance between the origin and the normal vector. A plane can be formally defined by all points $x$ such that $x^T\Pi = 0$ . Lines can be defined as the intersection between two planes, although expressing lines in 3D with 4 degrees of freedom is complicated.

When parallel lines in 3D point towards the direction $d = (a, b, c)$ in the camera coordinate system, it means that the point at infinity is $x_{\infty} = [a, b, c, 0]^T$ . The projective transformation by $M$ in camera models maps this to a vanishing point $v$ on the 2D image plane, which may no longer be a point at infinity. This can be expressed as $v = Mx_{\infty}$ or $v = Kd$ , where $K$ is the camera matrix. Further derivation yields the direction $d$ of the parallel lines that led to $v$ , $d = \frac{K^{-1}v}{||K^{-1}v||}$ .

Similarly, the lines at infinity, comprising points at infinity on a plane $\Pi$ , are projected to a line called the horizontal line $\ell_{\text{horiz}}$ on the image plane, which may also be no longer a line at infinity. Since all directions of parallel lines on the same plane $\Pi$ that led to vanishing points on the horizontal line must lie on the plane and be orthogonal to the normal vector $n$ of the plane, we have $d^Tn = v^TK^{-T}n = 0$ . Given that vanishing points are on the horizontal line as wel, $v^T\ell_{\text{horiz}} = 0$ , it follows that $\ell_{\text{horiz}} = K^{-T}n$ and $n = K^T\ell_{\text{horiz}}$ .

\cos(\theta)=\frac{d_1d_2}{||d_1||~||d_2||} = \frac{v_1^T \omega v_2}{\sqrt{v_1^T \omega v_1}\sqrt{v_2^T \omega v_2}} \\ \cos(\theta)=\frac{n_1n_2}{||n_1||~||n_2||} = \frac{\ell_1^T \omega^{-1} \ell_2}{\sqrt{\ell_1^T \omega^{-1} \ell_1}\sqrt{\ell_2^T \omega^{-1} \ell_2}} \\ , \text{where} ~ \omega = (KK^T)^{-1}

Therefore, if we can obtain $K$ through calibration and recognize the horizontal line associated with a plane in an image, we can estimate the normal vector $n$ of the plane and capture the orientation of a surface in 3D. These equations can be further developed to derive the angle between two directions ( $d_1$ and $d_2$ ) corresponding to distinct vanishing points ( $v_1$ and $v_2$ ) and the angle between two planes ( $n_1$ and $n_2$ ) corresponding to horizontal lines ( $\ell_1$ and $\ell_2$ ) using the cosine rule as the above shows.

Single View Metrology

Based on the mathematics established above, we can estimate various quantities about the camera and the subject in an image. For example, we can obtain two vanishing points from two pairs of parallel lines on two planes that are orthogonal to each other, and we can use the cosine equation to arrive at $v_1 \omega v_2 = 0$ . However, $K$ has at least 3 degrees of freedom, assuming no skewness and square pixels, and having only one constraint does not allow us to solve for $K$ from $\omega$ . Therefore, we can take another vanishing point $v_3$ from another plane that is orthogonal to both planes. This provides three constraints ( $v_1 \omega v_2 = 0$ , $v_1 \omega v_3 = 0$ , and $v_2 \omega v_3 = 0$ ) to solve for $K$ without knowing $P$ .

Once we know $K$ , we can use $n = K\ell_{\text{horiz}}$ to estimate the orientation of the planes and use them to reconstruct the estimated 3D scene from a single image. However, this method does not allow us to obtain the scale and position of the planes, nor does it account for occluded objects, which are essential for properly reconstructing the 3D scene captured in the image. (Different objects with different sizes, positions, and orientations can result in perfectly identical projections.) It demonstrates how we can extract rich information from a single image by using the properties of projective transformation of points and lines at infinity, but also reveals the inherent limitations of the approach of single view metrology.

It also explains the difficulty of depth estimation from a single image, even with deep learning, which can infer the size and occluded parts of an object by learning from training data to some extent. (This is why we tend to use scale-invariant loss for single view depth estimation to make the task easier and to let the model focus on learning the orientations.) Humans may be even more capable of interpreting an image and understanding the orientations of objects and inferring their sizes from our experiences in the real world, but the same physical limitations apply to us humans, making us susceptible to optical illusions (especially for objects with unusual shapes and sizes).

Conclusion

This article covered various transformations in 2D, points and lines at infinity, vanishing points and lines as the result of projective transformation of points and lines at infinity, and single view metrology for calibration and plane orientation estimation made possible by these concepts. We discovered how single view metrology is helpful but has inherent limitations due to the inevitable loss of information about the scale and position of the surfaces. However, the concepts and mathematics we covered might remain relevant in the future, where we will discuss another approach to understanding the 3D scene. Therefore, I recommend learning them extensively by reading this article and the resources cited below until you have no confusion left.

Resources

Hartley, R. & Zisserman, A. 2002. Multiple View Geometry in Computer Vision, Second Edition. Academic Press.
Hata, K. & Savarese, S. 2025. CS231A Course Notes 2: Single View Metrology. Stanford.
Savarese, S. & Bohg, J. 2025. Lecture 4 Single View Metrology. Stanford.