Road to ML Engineer #59 - Camera Models

The blog post introduces camera models and homogenous coordinates in computer vision.

So far, we have been discussing how we classify, detect, and segment objects, and generate images. However, we haven't covered how cameras work and can be modeled mathematically. Now that we have covered the basics of 2D computer vision (on images) and are transitioning to 3D computer vision, we will discuss camera models and homogeneous coordinates, which are fundamental concepts in computer vision.

Films & Sensors

Fundamentally, visual perception is the ability to detect light and form an image based on it. When we see an object, we are sensing and interpreting the light with varying intensity and wavelengths bouncing off the object and its surroundings. Therefore, if we can somehow capture the visible light (photo in Latin) reaching an apparatus at a point in time, we can record or graph (photograph) the visual of a scene as an image. Black and white film achieved this by capturing the intensity of visible light using silver halide crystals on a plastic base, which darken as they are exposed to light. By exposing the film for a short period of time, it can create a latent image (a negative image) of a scene, which can be chemically developed to arrive at the final image.

As research progressed, colored film was invented, with each layer reactive only to light in the spectrum corresponding to red, blue, and green. Furthermore, light-dependent resistors (LDRs), whose resistance changes with the intensity of light they are exposed to, and other light sensors were invented, leading to the birth of digital cameras. The digital camera uses color filters, mainly Bayer filters (RGGB), which arrange red, blue, and two green filters (two greens as human eyes are more sensitive to green) in a mosaic pattern, and performs demosaicing to produce colored images.

Pinhole Camera Model

Though films and sensors are the most crucial components of cameras, simply exposing them to light cannot capture an object clearly, since all the light emitted from any point on the object could reach every point on the film (or a sensor) even in a short period of time, creating a bright and blurry image. Hence, we can limit the light rays emitted from a point and reaching the film to one or a few by setting up a barrier with an aperture (pinhole) between the object and the camera to arrive at sharp images. This simple camera model is called the pinhole camera model, which is visualized below.

The film or sensor is called the image (retina) plane, the aperture at the center of the camera is called the pinhole $O$ , and the distance between the image plane and the pinhole is the focal length $f$ . Here, we define a camera coordinate system $[i, j, k]$ , where $k$ is perpendicular to the image plane and points towards it. The camera maps or projects a point of a 3D object $P = [x, y, z]^T$ to a point $P' = [x', y']^T$ on a 2D image plane $\Pi'$ . Using the law of similarities, we find that the pinhole camera projects $P$ to $\Pi'$ as $P' = [x', y']^T = [f\frac{x}{z}, f\frac{y}{z}]^T$ .

Paraxial Refraction Model

Though pinhole cameras can capture a scene sharply with a small enough pinhole, a small pinhole limits too much light and leads to a dimmer image. Hence, instead of using a simple pinhole (with the tradeoff between sharpness and intensity), we can use lenses to uniquely map all the light from $P$ to $P'$ , maintaining the intensity while achieving sharpness. The illustration below shows the camera model that uses a thin lens, which is called the paraxial refraction model. In this model, parallel light rays refract and reach a certain focal point, and the distance between the center of the lens and the focal point is called the focal length $f$ .

$Paraxial Refraction Model$

The distance from the focal point to the image plane is $z_0$ , and we can define the distance between the center of the lens and the image plane as $z'=f+z_0$ . Analogously to the pinhole camera model, the paraxial refraction model maps $P$ to $P'$ as $P' = [x', y']^T = [z'\frac{x}{z}, z'\frac{y}{z}]^T$ . In a digital camera, the coordinate system of the image typically differs from the camera coordinate system, which requires translation by $[c_x, c_y]^T$ . Also, we need to convert physical measurements (cm) of width and height using $k$ and $l$ (usually $\frac{\text{pixels}}{\text{cm}}$ ), respectively. Hence, the projection of $P$ to $\Pi'$ by the digital camera can be expressed as $P' = [x', y']^T = [\alpha\frac{x}{z} + c_x, \beta\frac{y}{z} + c_y]^T$ , where $\alpha=fk$ and $\beta=fl$ .

Homogenous Coordinates

From here onwards, we will be dealing with projections and applying various operations like rotation, translation, scaling, reflection, and shearing. However, translation is the only operation among these that requires vector addition instead of linear transformation by matrix multiplication in the standard coordinate system. This prevents us from expressing transformations as a chain of matrix multiplications, which are easier and faster to compute and manipulate.

Hence, instead of the standard coordinate system, we use homogenous coordinates, which adds an extra dimension to the standard coordinate system, like $[x, y]^T$ to $[x, y, 1]^T$ and $[x, y, z]^T$ to $[x, y, z, 1]^T$ . A point $[x, y, w]^T$ in the homogenous coordinate system can be converted back to $[\frac{x}{w}, \frac{y}{w}]^T$ in the standard coordinate system. The additional dimension enables us to express translation on a vector with matrix multiplication. Hence, the projections of digital cameras can be expressed in homogenous coordinates as follows.

P' = \begin{bmatrix} \alpha x + c_x z \\ \beta y + c_y z \\ z \end{bmatrix} = \begin{bmatrix} \alpha & 0 & c_x & 0 \\ 0 & \beta & c_y & 0 \\ 0 & 0 & 1 & 0 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \\ 1 \end{bmatrix} = MP = K[I~~~\vec{0}]P

As you can see, the expression is more concise as it can include the translation by $[c_x, c_y]$ in the matrix multiplication. The matrix used to transform $P$ to $P'$ in homogenous coordinates is $M$ , and $K$ is the result of decomposing $M$ to arrive at a 3x3 matrix containing important camera parameters, which is called the camera matrix. Another benefit of introducing homogenous coordinates is that they can express points, lines, and planes at infinity with infinite distance and area while preserving direction by setting the extra dimension $w = 0$ , unlike standard coordinate systems. Although this may seem useless and irrelevant for now, it will be crucial in the future.

Camera Matrix Model

As described above, the camera matrix contains parameters for focal length ( $\alpha$ and $\beta$ ) and offsets ( $c_x$ and $c_y$ ) between the camera coordinate system and the image coordinate system. These parameters depend on the camera itself and are therefore called intrinsic parameters. Aside from these, there are two additional parameters, skewness and distortion. Although most cameras have zero skew, manufacturing errors can cause the camera coordinate system to skew slightly. The following is the camera matrix that includes skewness.

K = \begin{bmatrix} \alpha & -\alpha \cot(\theta) & c_x \\ 0 & \frac{\beta}{\sin(\theta)} & c_y \\ 0 & 0 & 1 \end{bmatrix}

Here, $\theta$ is the angle between $i$ and $j$ in the camera coordinate system (the derivation is outside the scope of this article). We often ignore distortion effects, arriving at a camera matrix with 5 intrinsic parameters. Although it may seem that the camera matrix captures how a camera projects a 3D object to a 2D image plane entirely, this is only the case when $P$ is expressed with respect to the camera coordinate system, which is often not the case. $P$ is often expressed using an arbitrary world reference system, which requires additional rotation and translation to align it to the camera coordinate system.

P' = MP = K[R~~~T]P

Here, $R$ is the rotation matrix (which could be the result of multiplications between rotation matrices corresponding to rotations around the x, y, and z axes), and $T$ is the translation vector. The rotation matrix involves 3 parameters (rotation angles for the 3 axes), and the translation also has 3 parameters (for the 3 axes), resulting in 6 extrinsic parameters, which are independent of the camera (how the camera is manufactured). Hence, the full projection matrix $M$ has 11 degrees of freedom, consisting of 5 intrinsic parameters and 6 extrinsic parameters.

Camera Calibration

In many cases, the intrinsic and extrinsic camera parameters are unknown, making it difficult to deduce information about $P$ from $P'$ . However, these parameters can be inferred from an image when the camera and enough $P$ and $P'$ are available, and such problem of estimating the camera parameters is called camera calibration. By collecting various points in the world reference system and their corresponding points on the image, we can arrive at a system of equations for solving for each entry $m$ in $M$ , which can then be used to estimate the camera parameters.

Due to inevitable measurement errors and $m = 0$ always being a trivial solution, we constrain the solution by $||m||^2 =1$ and solve the minimization of $||Pm||^2$ using singular value decomposition (SVD). Although this calibration tends to work well, there are degenerate configurations where estimation is not possible, such as when all the points $P$ lie on the same plane. There is another way of camera calibration using a single image, which we will cover in the next article.

Conclusion

In this article, we covered how films and sensors work on a surface level, how we can model cameras, how we can express them concisely, and a brief introduction to the camera calibration problem. Although it may not seem relevant to computer vision, understanding camera models allows us to deduce information about $P$ and 3D objects captured in an image or multiple images, and to understand approaches to 3D computer vision. In the next article, we will dive deeper into the properties of homogenous coordinates and their applications in camera models.

Resources

Computerphile. 2015. Capturing Digital Images (The Bayer Filter) - Computerphile. YouTube.
Hata, K. & Savarese, S. 2025. CS231A Course Notes 1: Camera Models. Stanford.
Learn Learn Scratch Tutorials. 2022. How CCD and CMOS Sensors on cameras and scanners work. YouTube.
Monolith. 2024. Quick Understanding of Homogeneous Coordinates for Computer Graphics. YouTube.
SmarterEveryDay. 2022. How Does Film ACTUALLY Work? (It's MAGIC) [Photos and Development] - Smarter Every Day 258. YouTube.