Road to ML Engineer #69 - 3D Gaussian Splatting

In the previous article, we introduced NeRFs, which utilized the volume rendering equation and approximated a radiance field for an object or a scene with a neural network to achieve high quality 3D reconstruction and novel view synthesis. However, NeRFs were presented with some challenges, including the lengthy rendering time. In this article, we will discuss 3D gaussian splatting, which utilizes a vastly different approach to 3D representations and rendering that not only made the rendering short enough for real-time applications but also improved the quality of the novel view synthesis.

Rasterization

Volume rendering used by NeRFs is comparable to ray tracing, which is the traditional rendering method in computer graphics that shines rays corresponding to the pixels to the 3D scenes (often expressed in 3D meshes) and computes the pixel color given the light sources. Ray tracing can produce high quality views that take light sources and light reflection physics into account at the expense of slow rendering time, which has been used for non-real-time scenarios like 3D graphics for movies. Volume rendering in NeRFs largely has the same characteristics, but NeRFs inherently cannot model light reflection physics and do not permit changes in light sources. Hence, we can argue that NeRF's rendering does not inherit the largest benefit of ray tracing, high quality view synthesis, while carrying the drawback of slow rendering time.

The alternative rendering method, traditionally utilized for real time applications like game engines, is rasterization, which transforms the visible parts of 3D meshes into the image plane and adjusts the colors and edges with highly optimized shading and antialiasing algorithms. It also takes care of depths and occlusions with Z-buffer. We are not going to delve too much into the specifics of rasterization, but it essentially directly transforms the 3D representations onto the image and utilizes heuristics for taking lights and depths into account to improve the quality, which achieves fast rendering speed for real time rendering and moderately high rendering quality. If we are to improve the slow rendering speed that NeRFs are subjected to, it is sensible to look for any differentiable and efficient rasterization rendering method and appropriate 3D representation for it. (3D triangular meshes with hard vertices and edges are hard to optimize so an alternative representation is desired.)

3D Gaussian Splatting

For the 3D representation, we can utilize 3D Gaussians with means (positions), covariance matrices (shape and size of the Gaussians), and transparencies $\alpha$ s. To represent the view angle dependent colors of the Gaussians, we can utilize spherical harmonics. (It is basically a compact parametrization of spherical functions, analogous to Fourier transform for signals. More spherical harmonics coefficients, more complex color patterns on the sphere surface can be expressed.) The benefits of using 3D gaussians are higher expressivity than perfect spheres or points, the soft position and edges that are easier to work with and optimize, and most importantly, projection of them being 2D Gaussians. Given the viewing transformation $W$ and the Jacobian of the affine approximation of the projective transformation $J$ , the covariance matrix $\Sigma$ in camera coordinates $\Sigma'$ can be determined by $\Sigma' = JW \Sigma W^TJ^T$ . (The first two rows and columns form the 2D variance matrix, which represents the shape and size of the Gaussians after projection.)

Here, it is intuitive to directly optimize for $\Sigma$ using the images, but $\Sigma$ has to be positive semidefinite, which is a hard constraint to maintain. Hence, we can use an alternative expression of $\Sigma$ , the configuration of ellipsoid (stretched or compressed sphere), which consists of scaling matrix $S$ and rotation matrix $R$ , encoding stretching factors in 3 axes and rotations in 3 axes. Hence, we can model $\Sigma$ as $\Sigma = RSS^TR^T$ and optimize for $S$ and $R$ , along with other parameters of 3D Gaussians. The invented rendering of the Gaussians is tile-based rasterization, which divides the image into 16 by 16 tiles, obtains 3D Gaussians intersecting with the view frustum for each tile with a 99% confidence interval, performs fast GPU Radix sort based on depths, and colors each tile with $\alpha$ -blending (essentially where we compute the weighted sum of colors until cumulative $\alpha$ reaches 1). This tile-based rasterization is highly parallelizable, and efficient for real-time rendering, and is proved to be differentiable, making it possible to perform gradient descent on the parameters of the 3D Gaussians using ground truth and rendered images. (Please refer to the original paper for the details of the rasterizer and how the gradients flow to the parameters of the 3D Gaussians.)

The initial set of 3D Gaussians can be isotropic (perfect sphere) and can be obtained randomly or from point clouds produced from SfM for camera calibration. Then, the parameters of the Gaussians can be optimized using the gradients with gradient descent. However, the initial set of Gaussians is almost always too sparse, requiring densification or additions of Gaussians. There are two possible scenarios where densification is needed: under-construction (Gaussian too small to cover the volume) and over-construction (Gaussian is too large for the specificity required), and it is experimentally revealed that both have large gradients. Hence, at every 100 iterations, we densify by splitting a large Gaussian with high variance into two smaller Gaussians and cloning a smaller Gaussian and moving the clone to the direction of the gradient when the gradient of a Gaussian is above the predefined threshold ( $\tau_{\text{pos}}=0.0002$ ). To moderately increase the number of Gaussians while avoiding floaters (almost transparent blobs near the input camera which have been the problem deteriorating the quality for NeRFs too), we set all $\alpha$ to 0 every 3000 iterations, make the optimization increase the $\alpha$ as necessary, and remove Gaussians with $\alpha$ smaller than the predefined threshold ( $\epsilon_{\alpha}$ ) every 100 iterations.

This technique that utilizes this clever 3D representation using 3D Gaussians (whose covariance matrix is represented as configuration of ellipsoid), efficient and differentiable rasterizer, and considerate training scheme with the above adaptive density control, is called 3D gaussian splatting, and it not only achieved real-time rendering of highly complex scenes in high resolution (1080p) but also outperformed NeRF's rendering quality without complex and uninterpretable MLP. It received huge recognition for its superior performance to NeRF that it was implemented in Nerfstudio and has been licensed for many industrial applications today. Despite the success, the inherent limitations of regarding light reflection physics remain to be solved. There are efforts in creating datasets with different illuminations for relighting and deep learning approaches to shading the synthesized views, but it is not tackling the fundamental problem. One might suggest using more general rendering equations (like one shown in the article by Torralba, A., Isola, P., & Freeman, W. (n.d.)) that model light bouncing effects, though creating and designing suitable 3D representations and efficient and differentiable renderers are much easier said than done.

Conclusion

In this article, we introduced rasterization and the basic concept of 3D gaussian splatting, which can be considered as the current state-of-the-art for novel view synthesis. If you are interested in details of the rasterization process and gradient computations, I recommend checking out the resources cited below. (I highly recommend checking out the original paper and GitHub repository.) If you are interested in pursuing research on this topic, I highly recommend diving much deeper into the field of computer graphics and mathematics of the concepts we have covered.

Resources

Caulfield, B. 2018. Difference between ray tracing and rasterization. NVIDIA.
Kerbl, B. et al. 2023. 3D Gaussian Splatting for Real-Time Radiance Field Rendering. ACM Transactions on Graphics.
Matthew, T. et al. 2023. Nerfstudio: A Modular Framework for Neural Radiance Field Development. ArXiv.
Niessner, M. 2025. TUM AI Lecture Series - The 3D Gaussian Splatting Adventure: Past, Present, Futur (George Drettakis). YouTube.
NVIDIA. n.d. Ray Tracing. NDIVIA Developer.
Savarese, S. & Bohg, J. 2025. Lecture 17 Gaussian Splatting for Novel View Synthesis. Stanford.
Torralba, A., Isola, P., & Freeman, W. n.d. 45 Radiance Fields. Foundations of Computer Vision.
Yang, L. 2024. 驚くほどキレイな三次元シーン復元、「3D Gaussian Splatting」を徹底的に解説する. Qiita.