Road to ML Engineer #59 - Camera Models

Last Edited: 5/29/2025

The blog post introduces camera models and homogenous coordinates in computer vision.

ML

So far, we have been discussing how we classify, detect, and segment objects, and generate images. However, we haven't covered how cameras work and can be modeled mathematically. Now that we have covered the basics of 2D computer vision (on images) and are transitioning to 3D computer vision, we will discuss camera models and homogeneous coordinates, which are fundamental concepts in computer vision.

Films & Sensors

Fundamentally, visual perception is the ability to detect light and form an image based on it. When we see an object, we are sensing and interpreting the light with varying intensity and wavelengths bouncing off the object and its surroundings. Therefore, if we can somehow capture the visible light (photo in Latin) reaching an apparatus at a point in time, we can record or graph (photograph) the visual of a scene as an image. Black and white film achieved this by capturing the intensity of visible light using silver halide crystals on a plastic base, which darken as they are exposed to light. By exposing the film for a short period of time, it can create a latent image (a negative image) of a scene, which can be chemically developed to arrive at the final image.

As research progressed, colored film was invented, with each layer reactive only to light in the spectrum corresponding to red, blue, and green. Furthermore, light-dependent resistors (LDRs), whose resistance changes with the intensity of light they are exposed to, and other light sensors were invented, leading to the birth of digital cameras. The digital camera uses color filters, mainly Bayer filters (RGGB), which arrange red, blue, and two green filters (two greens as human eyes are more sensitive to green) in a mosaic pattern, and performs demosaicing to produce colored images.

Pinhole Camera Model

Though films and sensors are the most crucial components of cameras, simply exposing them to light cannot capture an object clearly, since all the light emitted from any point on the object could reach every point on the film (or a sensor) even in a short period of time, creating a bright and blurry image. Hence, we can limit the light rays emitted from a point and reaching the film to one or a few by setting up a barrier with an aperture (pinhole) between the object and the camera to arrive at sharp images. This simple camera model is called the pinhole camera model, which is visualized below.

Pinhole Camera Model

The film or sensor is called the image (retina) plane, the aperture at the center of the camera is called the pinhole OO, and the distance between the image plane and the pinhole is the focal length ff. Here, we define a camera coordinate system [i,j,k][i, j, k], where kk is perpendicular to the image plane and points towards it. The camera maps or projects a point of a 3D object P=[x,y,z]TP = [x, y, z]^T to a point P=[x,y]TP' = [x', y']^T on a 2D image plane Π\Pi'. Using the law of similarities, we find that the pinhole camera projects PP to Π\Pi' as P=[x,y]T=[fxz,fyz]TP' = [x', y']^T = [f\frac{x}{z}, f\frac{y}{z}]^T.

Paraxial Refraction Model

Though pinhole cameras can capture a scene sharply with a small enough pinhole, a small pinhole limits too much light and leads to a dimmer image. Hence, instead of using a simple pinhole (with the tradeoff between sharpness and intensity), we can use lenses to uniquely map all the light from PP to PP', maintaining the intensity while achieving sharpness. The illustration below shows the camera model that uses a thin lens, which is called the paraxial refraction model. In this model, parallel light rays refract and reach a certain focal point, and the distance between the center of the lens and the focal point is called the focal length ff.

Paraxial Refraction Model

The distance from the focal point to the image plane is z0z_0, and we can define the distance between the center of the lens and the image plane as z=f+z0z'=f+z_0. Analogously to the pinhole camera model, the paraxial refraction model maps PP to PP' as P=[x,y]T=[zxz,zyz]TP' = [x', y']^T = [z'\frac{x}{z}, z'\frac{y}{z}]^T. In a digital camera, the coordinate system of the image typically differs from the camera coordinate system, which requires translation by [cx,cy]T[c_x, c_y]^T. Also, we need to convert physical measurements (cm) of width and height using kk and ll (usually pixelscm\frac{\text{pixels}}{\text{cm}}), respectively. Hence, the projection of PP to Π\Pi' by the digital camera can be expressed as P=[x,y]T=[αxz+cx,βyz+cy]TP' = [x', y']^T = [\alpha\frac{x}{z} + c_x, \beta\frac{y}{z} + c_y]^T, where α=fk\alpha=fk and β=fl\beta=fl.

Homogenous Coordinates

From here onwards, we will be dealing with projections and applying various operations like rotation, translation, scaling, reflection, and shearing. However, translation is the only operation among these that requires vector addition instead of linear transformation by matrix multiplication in the standard coordinate system. This prevents us from expressing transformations as a chain of matrix multiplications, which are easier and faster to compute and manipulate.

Hence, instead of the standard coordinate system, we use homogenous coordinates, which adds an extra dimension to the standard coordinate system, like [x,y]T[x, y]^T to [x,y,1]T[x, y, 1]^T and [x,y,z]T[x, y, z]^T to [x,y,z,1]T[x, y, z, 1]^T. A point [x,y,w]T[x, y, w]^T in the homogenous coordinate system can be converted back to [xw,yw]T[\frac{x}{w}, \frac{y}{w}]^T in the standard coordinate system. The additional dimension enables us to express translation on a vector with matrix multiplication. Hence, the projections of digital cameras can be expressed in homogenous coordinates as follows.

P=[αx+cxzβy+cyzz]=[α0cx00βcy00010][xyz1]=MP=K[I   0]P P' = \begin{bmatrix} \alpha x + c_x z \\ \beta y + c_y z \\ z \end{bmatrix} = \begin{bmatrix} \alpha & 0 & c_x & 0 \\ 0 & \beta & c_y & 0 \\ 0 & 0 & 1 & 0 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \\ 1 \end{bmatrix} = MP = K[I~~~\vec{0}]P

As you can see, the expression is more concise as it can include the translation by [cx,cy][c_x, c_y] in the matrix multiplication. The matrix used to transform PP to PP' in homogenous coordinates is MM, and KK is the result of decomposing MM to arrive at a 3x3 matrix containing important camera parameters, which is called the camera matrix. Another benefit of introducing homogenous coordinates is that they can express points, lines, and planes at infinity with infinite distance and area while preserving direction by setting the extra dimension w=0w = 0, unlike standard coordinate systems. Although this may seem useless and irrelevant for now, it will be crucial in the future.

Camera Matrix Model

As described above, the camera matrix contains parameters for focal length (α\alpha and β\beta) and offsets (cxc_x and cyc_y) between the camera coordinate system and the image coordinate system. These parameters depend on the camera itself and are therefore called intrinsic parameters. Aside from these, there are two additional parameters, skewness and distortion. Although most cameras have zero skew, manufacturing errors can cause the camera coordinate system to skew slightly. The following is the camera matrix that includes skewness.

K=[ααcot(θ)cx0βsin(θ)cy001] K = \begin{bmatrix} \alpha & -\alpha \cot(\theta) & c_x \\ 0 & \frac{\beta}{\sin(\theta)} & c_y \\ 0 & 0 & 1 \end{bmatrix}

Here, θ\theta is the angle between ii and jj in the camera coordinate system (the derivation is outside the scope of this article). We often ignore distortion effects, arriving at a camera matrix with 5 intrinsic parameters. Although it may seem that the camera matrix captures how a camera projects a 3D object to a 2D image plane entirely, this is only the case when PP is expressed with respect to the camera coordinate system, which is often not the case. PP is often expressed using an arbitrary world reference system, which requires additional rotation and translation to align it to the camera coordinate system.

P=MP=K[R   T]P P' = MP = K[R~~~T]P

Here, RR is the rotation matrix (which could be the result of multiplications between rotation matrices corresponding to rotations around the x, y, and z axes), and TT is the translation vector. The rotation matrix involves 3 parameters (rotation angles for the 3 axes), and the translation also has 3 parameters (for the 3 axes), resulting in 6 extrinsic parameters, which are independent of the camera (how the camera is manufactured). Hence, the full projection matrix MM has 11 degrees of freedom, consisting of 5 intrinsic parameters and 6 extrinsic parameters.

Camera Calibration

In many cases, the intrinsic and extrinsic camera parameters are unknown, making it difficult to deduce information about PP from PP'. However, these parameters can be inferred from an image when the camera and enough PP and PP' are available, and such problem of estimating the camera parameters is called camera calibration. By collecting various points in the world reference system and their corresponding points on the image, we can arrive at a system of equations for solving for each entry mm in MM, which can then be used to estimate the camera parameters.

Due to inevitable measurement errors and m=0m = 0 always being a trivial solution, we constrain the solution by m2=1||m||^2 =1 and solve the minimization of Pm2||Pm||^2 using singular value decomposition (SVD). Although this calibration tends to work well, there are degenerate configurations where estimation is not possible, such as when all the points PP lie on the same plane. There is another way of camera calibration using a single image, which we will cover in the next article.

Conclusion

In this article, we covered how films and sensors work on a surface level, how we can model cameras, how we can express them concisely, and a brief introduction to the camera calibration problem. Although it may not seem relevant to computer vision, understanding camera models allows us to deduce information about PP and 3D objects captured in an image or multiple images, and to understand approaches to 3D computer vision. In the next article, we will dive deeper into the properties of homogenous coordinates and their applications in camera models.

Resources