Road to ML Engineer #60 - Single View Metrology

Last Edited: 6/4/2025

The blog post introduces single view metrology in computer vision.

ML

In the last article, we covered the basics of camera models and homogenous coordinates and applied them to camera calibration. However, the camera calibration required sufficient PP to be known, which is unavailable in most real-world scenarios. In fact, we often would like to reason about PP. Hence, in this article, we will explore how properties of homogenous coordinates can be used to calibrate a camera and estimate PP from a single image or perform single view metrology.

2D Transformations

Before diving into single view metrology, we need to understand the various transformations in 2D. Isometric transformations are transformations that can be described by rotation and translation, and they preserve distances. In homogenous coordinates, we can express them using a matrix containing the rotation matrix RR in the top left, the translation vector tt in the top right, zeros in the bottom left, and 1 in the bottom right.

[xy1]=[SRt01][xy1],S=[s00s]\begin{bmatrix} x' \\ y' \\ 1 \end{bmatrix} = \begin{bmatrix} SR & t \\ 0 & 1 \\ \end{bmatrix} \begin{bmatrix} x \\ y \\ 1 \end{bmatrix}, S = \begin{bmatrix} s & 0 \\ 0 & s \\ \end{bmatrix}

Similarity transformations include scaling transformations applied to isometric transformations, preserving shapes. Hence, a scaling matrix SS, containing a scalar factor ss in the diagonal matrix, can be multiplied by the rotation matrix in the isometric transformation matrix to arrive at the similarity transformation matrix. Here, we can realize that the isometric transformation is a special form of similarity transformation when s=1s=1.

[xy1]=[At01][xy1]\begin{bmatrix} x' \\ y' \\ 1 \end{bmatrix} = \begin{bmatrix} A & t \\ 0 & 1 \\ \end{bmatrix} \begin{bmatrix} x \\ y \\ 1 \end{bmatrix}

Affine transformations include shearing transformations applied to similarity transformations, resulting in linear transformations with a matrix AA and translation by tt. Affine transformations preserve points, straight lines, and parallelism. Hence, both isometric and similarity transformations are special forms of affine transformations. Projective transformations further generalize affine transformations by transforming additional dimensions using vv and bb as follows.

[xy1]=[Atvb][xy1]\begin{bmatrix} x' \\ y' \\ 1 \end{bmatrix} = \begin{bmatrix} A & t \\ v & b \\ \end{bmatrix} \begin{bmatrix} x \\ y \\ 1 \end{bmatrix}

As a result, projective transformations no longer preserve parallelism and instead preserve colinearity of points. This means lines will be mapped to lines with possibly different angles and lengths. A line in homogenous coordinates in 2D can be defined by =[a,b,c]T\ell=[a, b, c]^T, where the slope and y-intercept are captured by ab-\frac{a}{b} and cb-\frac{c}{b}, respectively. All points p=[x,y,1]Tp=[x,y,1]^T on the line satisfy Tp=0\ell^Tp=0. Since projective transformations preserve colinearity of points, they map \ell to =[a,b,c]T\ell'=[a', b', c']^T with new slope and intercept.

Points & Lines at Infinity

When two lines, \ell and \ell', intersect, the point of intersection xx must satisfy both Tx=0\ell^Tx=0 and Tx=0\ell'^Tx=0, meaning xx must be orthogonal to both \ell and \ell'. We can use these orthogonalities to find x=×x=\ell \times \ell', which constrains xx to be orthogonal to both \ell and \ell' by the definition of the cross product. Using this, we can compute the hypothetical intersection between the parallel lines =[a,b,c]T\ell=[a, b, c]^T and =[a,b,c]T\ell'=[a', b', c']^T, where the slopes are equal (i.e., ab=ab\frac{a}{b}=\frac{a'}{b'} meaning abba=0ab'-ba'=0 and aa=bb\frac{a'}{a}=\frac{b'}{b}).

[abc]×[abc]=[bccbcaacabba]=[bcc(baa)c(abb)ac0]=[(c+caa)b(c+cbb)a0][ba0]=x\begin{bmatrix} a \\ b \\ c \end{bmatrix} \times \begin{bmatrix} a' \\ b' \\ c' \end{bmatrix} = \begin{bmatrix} bc' - cb' \\ ca' - ac' \\ ab' - ba' \\ \end{bmatrix} = \begin{bmatrix} bc' - c(-\frac{ba'}{a}) \\ c(\frac{ab'}{b}) - ac' \\ 0 \\ \end{bmatrix} = \begin{bmatrix} (c'+c\frac{a'}{a})b \\ -(c'+c\frac{b'}{b})a \\ 0 \\ \end{bmatrix} \propto \begin{bmatrix} b \\ -a \\ 0 \\ \end{bmatrix} = x_{\infty}

We can see that the last entry is zero, meaning the two parallel lines intersect at infinity. We can define lines at infinity as well, where all the points at infinity (the intersection of parallel lines) lie. These lines are represented as =[0,0,c]T\ell_{\infty}=[0, 0, c]^T, where cc is an arbitrary value that can be simply set to 1. We find that projective transformations of points and lines at infinity do not necessarily map them to another points and lines at infinity due to the influence of vv, while affine transformations do. (This intuitively makes sense since parallel lines that construct a point at infinity will no longer be guaranteed to be parallel after projective transformations.)

Vanishing Points & Lines

In the 3D world, we need to introduce the concept of a plane, which can be expressed as Π=[a,b,c,d]T\Pi = [a, b, c, d]^T, where (a,b,c)(a, b, c) forms the normal vector nn of the plane and dd represents the distance between the origin and the normal vector. A plane can be formally defined by all points xx such that xTΠ=0x^T\Pi = 0. Lines can be defined as the intersection between two planes, although expressing lines in 3D with 4 degrees of freedom is complicated.

Vanishing Point

When parallel lines in 3D point towards the direction d=(a,b,c)d = (a, b, c) in the camera coordinate system, it means that the point at infinity is x=[a,b,c,0]Tx_{\infty} = [a, b, c, 0]^T. The projective transformation by MM in camera models maps this to a vanishing point vv on the 2D image plane, which may no longer be a point at infinity. This can be expressed as v=Mxv = Mx_{\infty} or v=Kdv = Kd, where KK is the camera matrix. Further derivation yields the direction dd of the parallel lines that led to vv, d=K1vK1vd = \frac{K^{-1}v}{||K^{-1}v||}.

Horizontal Line

Similarly, the lines at infinity, comprising points at infinity on a plane Π\Pi, are projected to a line called the horizontal line horiz\ell_{\text{horiz}} on the image plane, which may also be no longer a line at infinity. Since all directions of parallel lines on the same plane Π\Pi that led to vanishing points on the horizontal line must lie on the plane and be orthogonal to the normal vector nn of the plane, we have dTn=vTKTn=0d^Tn = v^TK^{-T}n = 0. Given that vanishing points are on the horizontal line as wel, vThoriz=0v^T\ell_{\text{horiz}} = 0, it follows that horiz=KTn\ell_{\text{horiz}} = K^{-T}n and n=KThorizn = K^T\ell_{\text{horiz}}.

cos(θ)=d1d2d1 d2=v1Tωv2v1Tωv1v2Tωv2cos(θ)=n1n2n1 n2=1Tω121Tω112Tω12,where ω=(KKT)1 \cos(\theta)=\frac{d_1d_2}{||d_1||~||d_2||} = \frac{v_1^T \omega v_2}{\sqrt{v_1^T \omega v_1}\sqrt{v_2^T \omega v_2}} \\ \cos(\theta)=\frac{n_1n_2}{||n_1||~||n_2||} = \frac{\ell_1^T \omega^{-1} \ell_2}{\sqrt{\ell_1^T \omega^{-1} \ell_1}\sqrt{\ell_2^T \omega^{-1} \ell_2}} \\ , \text{where} ~ \omega = (KK^T)^{-1}

Therefore, if we can obtain KK through calibration and recognize the horizontal line associated with a plane in an image, we can estimate the normal vector nn of the plane and capture the orientation of a surface in 3D. These equations can be further developed to derive the angle between two directions (d1d_1 and d2d_2) corresponding to distinct vanishing points (v1v_1 and v2v_2) and the angle between two planes (n1n_1 and n2n_2) corresponding to horizontal lines (1\ell_1 and 2\ell_2) using the cosine rule as the above shows.

Single View Metrology

Based on the mathematics established above, we can estimate various quantities about the camera and the subject in an image. For example, we can obtain two vanishing points from two pairs of parallel lines on two planes that are orthogonal to each other, and we can use the cosine equation to arrive at v1ωv2=0v_1 \omega v_2 = 0. However, KK has at least 3 degrees of freedom, assuming no skewness and square pixels, and having only one constraint does not allow us to solve for KK from ω\omega. Therefore, we can take another vanishing point v3v_3 from another plane that is orthogonal to both planes. This provides three constraints (v1ωv2=0v_1 \omega v_2 = 0, v1ωv3=0v_1 \omega v_3 = 0, and v2ωv3=0v_2 \omega v_3 = 0) to solve for KK without knowing PP.

Single View Metrology Example

Once we know KK, we can use n=Khorizn = K\ell_{\text{horiz}} to estimate the orientation of the planes and use them to reconstruct the estimated 3D scene from a single image. However, this method does not allow us to obtain the scale and position of the planes, nor does it account for occluded objects, which are essential for properly reconstructing the 3D scene captured in the image. (Different objects with different sizes, positions, and orientations can result in perfectly identical projections.) It demonstrates how we can extract rich information from a single image by using the properties of projective transformation of points and lines at infinity, but also reveals the inherent limitations of the approach of single view metrology.

It also explains the difficulty of depth estimation from a single image, even with deep learning, which can infer the size and occluded parts of an object by learning from training data to some extent. (This is why we tend to use scale-invariant loss for single view depth estimation to make the task easier and to let the model focus on learning the orientations.) Humans may be even more capable of interpreting an image and understanding the orientations of objects and inferring their sizes from our experiences in the real world, but the same physical limitations apply to us humans, making us susceptible to optical illusions (especially for objects with unusual shapes and sizes).

Conclusion

This article covered various transformations in 2D, points and lines at infinity, vanishing points and lines as the result of projective transformation of points and lines at infinity, and single view metrology for calibration and plane orientation estimation made possible by these concepts. We discovered how single view metrology is helpful but has inherent limitations due to the inevitable loss of information about the scale and position of the surfaces. However, the concepts and mathematics we covered might remain relevant in the future, where we will discuss another approach to understanding the 3D scene. Therefore, I recommend learning them extensively by reading this article and the resources cited below until you have no confusion left.

Resources