Road to ML Engineer #68 - Neural Radiance Fields

So far, we have covered 3D computer vision tasks like MDE, stereo matching, feature tracking, and structure and motion, corresponding to single-view metrology, epipolar geometry, and structure from motion. However, as has been consistently mentioned in previous articles, these tasks still face difficulties and limitations related to the correspondence problem. Hence, in this article, we introduce Neural Radiance Fields (NeRF), an implicit deep learning counterpart to volumetric stereo that takes a different approach to avoid the correspondence problem entirely when tackling 3D reconstruction.

Radiance Fields

Volumetric stereo interprets 3D objects as collections of voxels on a 3D voxel grid, each with color and transparency ( $r, g, b, \alpha$ ), and infers the voxel properties at each coordinate to match the images capturing the objects. Similarly to volumetric stereo, we can alternatively interpret 3D objects as collections of particles or points (not cubes) with color and density (transparency), producing a point cloud consistent with the images. Here, color can be made dependent not only on coordinates ( $x, y, z$ ) but also on angles ( $\psi, \phi$ ) to effectively model reflective, transparent, and other material surfaces (including smoke and other diffuse effects).

Radiance fields are vector fields, based on this modeling of 3D objects as point clouds, mapping coordinates and angles to color and density, $L:X, Y, Z, \psi, \phi \to r, g, b, \sigma$ . (We can further break $L$ into subcomponents like $L^c: X, Y, Z, \psi, \phi \to r, g, b$ and $L^{\sigma}: X, Y, Z \to \sigma$ for color and density to reflect the dependencies.) When modeled efficiently and successfully, the parametrizations of radiance fields can become an efficient alternative 3D representation constructed without explicitly solving the correspondence problem, useful for 3D reconstructions and view morphing.

Volume Rendering Equation

Once we have radiance fields, we can render images using volume rendering, which works by integrating radiance values along the camera ray. Specifically, this involves summing the colors of the particles along the ray, weighted by the density of those particles. The following is the volume rendering equation that produces a color $\ell(r)$ corresponding to the ray $r$ from $L^{\sigma}$ and $L^c$ .

\ell(r) = \int_{t_n}^{t_f} \alpha(t)L^{\sigma}(r(t))L^c(r(t), D) dt \\ \alpha(t) = \text{exp}(-\int_{t_n}^{t} L^{\sigma}(r(t))dt)

Here, $r(t)$ represents the coordinate along the ray $r$ at point $t$ , $D$ is the unit vector of the ray, and $t_n$ and $t_f$ correspond to the cutoff points of the integral. Hence, $\alpha(t)$ represents the transparency of the particles between the starting point and the point $r(t)$ . When all particles had zero density up to that point, $\alpha(t)$ would be 1, not impacting the integral. Conversely, when the particle density of a solid object is set to infinity, $\alpha(t)$ would be zero immediately after the ray intersects the surface of the object. (This model, consisting of empty space with zero density and solid objects with infinite density, is called ray casting and forms the basis of many renderers.)

\ell(r) \approx \sum_{i=1}^{T} \alpha_i(1-e^{-L^{\sigma}(R_i)\delta_i})L^c(R_i, D) \\ \alpha_i = \text{exp}(-\sum_{j=1}^{i-1} L^{\sigma}(R_j)\delta_j) \\ \delta_j = t_{i+1} - t_i

Since volume rendering integrals can be arbitrarily complex and infeasible to deal with, we can approximate the rendering integral with the sum of discrete samples like the above. When reconstructing the image, we can firstly pick a pixel $n, m$ , which can be translated to the corresponding camera ray direction $D$ using the camera coordinate $O_{\text{cam}}$ and camera matrix $K_{\text{cam}}$ obtained through camera calibration. ( $O=O_{\text{cam}}$ , $\tilde{D}=K_{\text{cam}}^{-1} * [n, m, 1]^T - O_{\text{cam}}$ , and $D=\tilde{D}/||\tilde{D}||_2^2$ ) Then, we can sample ${R_1, R_2, ..., R_T}$ along the corresponding camera ray ( $R = O + tD$ ) and use the above approximation of the integral to obtain $\ell$ for the ray corresponding to the pixel. An important observation here is that we can make the entire rendering process differentiable, suggesting the possibility of fitting a parameterized function $L_{\theta}$ to $L$ using gradient descent.

Neural Radiance Fields

Here, we can use gradient descent to train a neural network, which can approximate fairly complex functions and can run efficiently on GPUs that have long been used for computer graphics. Specifically, we can train an MLP as $L_{\theta}$ to approximate $L$ with the loss $\sum ||\ell(n, m) - \hat{\ell}(n, m)||_2^2$ , where $\ell$ and $\hat{\ell}$ are the ground truth color from an image and the predicted color using $L_{\theta}$ and the volume rendering equation. This approach of fitting a neural network to the radiance field of an object or a scene is named Neural Radiance Fields (NeRF), which has achieved tremendous success in producing compact implicit 3D representations for reconstruction and novel view synthesis.

However, NeRF does not implement the naive method described above, due to challenges regarding the low dimensionality of the input and the sampling. To increase the dimensionality and separability of the 5D input and sample coordinates intelligently, NeRF introduces sinusoidal positional encoding and hierarchical volume sampling. The positional encoding is almost identical to the one used in Transformers and is confirmed by ablation studies to contribue to capturing fine-grained details. Hierarchical volume sampling is performed by optimizing two networks, a coarse network that uniformly samples coordinates, and a fine network that samples around regions with high predicted density by the coarse network, using the loss $\sum ||\ell(n, m) - \hat{\ell}_c(n, m)||_2^2 + ||\ell(n, m) - \hat{\ell}_f(n, m)||_2^2$ , which is also confirmed by ablation studies to improve performance.

After training NeRF on a Deep Voxels dataset containing 4 Lambertian objects with simple geometry ( $512\times512$ pixels from viewpoints sampled on the upper hemisphere) and a custom dataset containing 8 realistic, non-Lambertian objects with complex geometry ( $800\times800$ pixels), it was compared with other approaches to neural 3D representations and novel view synthesis (Neural Volumes (NV), Scene Representation Networks (SRN), and Local Light Field Fusion (LLFF)) using several evaluation metrics. The results showed that it outperformed these other approaches in almost all metrics. The metrics used were Peak Signal-to-Noise Ratio (PSNR) for comparing the predicted image with ground truth, Structural Similarity Index Measure (SSIM) for comparing luminance, contrast, and structure, and Learned Perceptual Image Patch Similarity (LPIPS) for comparing perceptual similarity and realism.

NeRF also requires only about 5MB for the network weights to represent the scene, which is even smaller than the input images. (LLFF, which produces a voxel grid, requires over 15GB for realistic scenes.) However, NeRF inevitably comes with some challenges. NeRF typically takes approximately 1 to 2 days to train for a scene on a GPU and may not be efficient enough for real-time inference, depending on the sample size and hardware. Also, volume rendering has an inherent limitation that it does not model the physics of light reflections, and so it only paints shadows and reflections as observed (although this limitation is not specific to NeRF, and NeRF is not designed to account for it in the first place).

Despite these challenges, NeRF has made a significant contribution to the field with its relatively simple approach and impressive performance, leading to the emergence of many research efforts in neural 3D representations (such as LeRF, which combines CLIP for a language interface, and Instruct-NeRF2NeRF, which uses InstructPix2Pix for iterative editing of NeRF scenes with text instructions). The release of Nerfstudio, a simple API for creating, training, and testing NeRFs and other NeRF-related techniques, has also contributed immensely to the accessibility and significance of NeRF's impact.

Conclusion

In this article, we introduced radiance fields, volume rendering, and NeRF, which have made a significant contribution to the research of neural 3D representations and novel view synthesis, despite some challenges. I would highly recommend checking out the resources cited below and other relevant resources for further information and testing NeRFs.

Resources

Matthew, T. et al. 2023. Nerfstudio: A Modular Framework for Neural Radiance Field Development. ArXiv.
Mildenhall, B. et al. 2020. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. ArXiv.
Savarese, S. & Bohg, J. 2025. Lecture 16 Neural Radiance Fields for Novel View Synthesis. Stanford.
Torralba, A., Isola, P., & Freeman, W. n.d. 45 Radiance Fields. Foundations of Computer Vision.