Road to ML Engineer #54 - Denoising Diffusion

The blog post introduces denoising diffusion probablistic models in computer vision.

In the previous articles in this series, we have discussed GANs and VAEs for image generation and moved on to discuss segmentations and object detections, which are presumably more complex. However, all the generative models we have discussed so far had areas for improvement, especially regarding generating photorealistic and high-resolution images. Hence, in this article, we will discuss improvements made in models for image generation that have contributed to the impressive performance of recent models.

Inception Score & FID

Before diving into generative models, however, we need to discuss metrics for evaluating image generations beyond human observation, which we have not yet covered. (This is because we have not covered the prerequisites for them in articles for VAE and GAN, and we were focusing more on obtaining discrete latent representations rather than reconstruction for DVAE.) There are two main metrics being used, Inception Score (IS) and Frechet Inception Distance (FID).

Inception Score is measured by the KL divergence between the conditional probability distribution of labels predicted by Inception-v3 (a CNN pretrained on ImageNet) given the generated images and the marginal probability distribution of labels ( $\text{IS} = \text{exp}(\text{E}_{x \sim p_g}[D_{KL}(p(y|x) || p(y))])$ ). We want the generator to generate images that can be clearly classified by Inception-v3, resulting in a skewed conditional probability distribution. Simultaneously, we want the generator to produce images of diverse classified labels, resulting in a uniform marginal probability distribution. Thus, it is sensible to use KL divergence between them, which increases as the conditional probability becomes more skewed and the marginal probability becomes more uniform.

Though IS is widely used in the domain, it has some drawbacks that we need to be aware of. The most obvious and significant drawback is its reliance on Inception-v3, which is not the best image classification model, only pretrained on ImageNet (1000 classes), and has a limited input size ( $300 \times 300$ ). It uses the skewness of the conditional probability distribution for assessing quality instead of comparing with real images, and it might not necessarily capture the quality of high-resolution photorealistic images with a broader set of classes and multiple objects. The generative model can also produce the same images for the same labels, which are very similar to the images in ImageNet, to achieve high diversity and an IS score.

To alleviate the quality issue, we also use FID, which compares the distributions (assumed to be Gaussian) of the features extracted (and passed to a linear layer to create embeddings) from real and generated images by Inception-v3 ( $\text{FID} = || \mu_X - \mu_Y ||^2 + \text{Tr}(\Sigma_X + \Sigma_Y - 2\sqrt{\Sigma_X \Sigma_Y})$ , where $X$ and $Y$ are real and generated images, and $\text{Tr}$ is the trace function, which takes the sum of all the diagonal entries). Though we can use real images to incorporate how realistic the generations are into the quality evaluation with FID, we are still relying on Inception-v3 pretrained on ImageNet. We also have to have large samples to maintain the plausibility of the Gaussian assumption, which makes it more computationally expensive, and yet only makes use of the mean and covariance, which might not capture the nuances. Hence, it is important to keep exploring other potential metrics and using human evaluations, just like in other generative tasks like text and audio generation.

Hierarchical VAEs

VAEs are interesting from both a Bayesian perspective (which we intentionally avoid discussing here and might cover in the future) and a deep learning perspective, as they offer both data compression and image generation by mapping inputs to a distributional latent space. However, VAEs have a unique challenge where a sufficiently large decoder can memorize all the images and disregard the latents, preventing the encoders from learning to capture useful latents. Also, the size of the latents might not be enough to capture various features, suffering from significant information loss.

The intuitive solution to the problem is to introduce intermediate latents and gradually capture features before reaching the final latent (which is usually a standard Gaussian), thereby introducing inductive bias and making learning easier, at least in theory. However, this hierarchical VAE approach suffers from posterior collapse, where the models quickly learn to produce a standard Gaussian distribution, whose samples are ignored even by a small decoder as useless Gaussian noise. Due to posterior collapse, hierarchical VAEs perform poorer than VAEs in practice, unless tricks like residual connections are used to encourage the models to use the last latent for reconstruction.

Auto-Regressors

While VAEs struggled with image generation performance, auto-regressors achieved better results. By leveraging correlations between pixels, they can sample a pixel and generate coherent images through ancestral sampling. Consequently, they couldn't learn a latent representation nor generate many neighboring and correlated pixels per iteration, as they depend on the conditional probabilities or correlations for narrowing down possible pixel values and avoiding generating the average of all possibilities. This resulted in a large number of iterations (~ $10^6$ ) to generate an image.

We can interpret auto-regressors through the lens of a hierarchical VAE to find the differences that led to better performance. They utilize a fixed encoder, implicitly removing some uncorrelated pixels to produce intermediate latents (with the same size as the input), and a final latent (where all pixels are removed). The decoder then iteratively processes these latents to gradually recover the original image, focusing solely on learning to reconstruct the original image by capturing conditional probabilities or correlations between latents. This suggests a potential issue with hierarchical VAEs that the heavily parameterized encoder might be unnecessary complexity. We can also speculate that making the model process as many uncorrelated pixel values as possible per iteration could significantly improve performance and speed.

Denoising Diffusion Probablistic Models

Here, we can envision a fixed encoder adding Gaussian noise samples until the latent becomes a pure standard Gaussian noise, and a parameterized decoder gradually denoising the latent to recover the input. This way, all noise for the pixel values is independently sampled and uncorrelated, allowing the model to focus on capturing as much correlations between latents as possible for auto-regressive image generation with less iterations (~ $10^2$ ). This approach is referred to as denoising diffusion probabilistic models (DDPMs) and is arguably the most common approach in diffusion probabilistic models (DPMs). (We have other types, like score-based generative models and flow-based diffusion models, which we might cover in the future.)

L = \text{E}_q[D_{KL}(q(x_T | x_0) || p(x_T)) + \sum_{t > 1} D_{KL}(q(x_{t-1} | x_t, x_0) || p(x_{t-1} | x_t)) - logp_{\theta}(x_0 | x_1)]

Here, $q$ is the encoder, which adds noise with a variance schedule $\beta_t$ like $q(x_t | x_{t-1}) := N(x_t; \sqrt{1-\beta_t}, \beta_t\text{I})$ , and $p_{\theta}$ is the parameterized decoder. The first term $D_{KL}(q(x_T | x_0) || p(x_T))$ enforces the last latent to follow the standard Gaussian $p(x_T) \sim N(0,\text{I})$ , when the noise schedule in the encoder is trainable. The second term trains the decoder to denoise the intermediate latents to the next, and the final term $-logp_{\theta}(x_0 | x_1)$ is the reconstruction loss, which can be simplified as the MSE between $x$ and the decoder output. We can use notations $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}_t = \Pi_{s=1}^t \alpha_s$ to express $q(x_t | x_0)$ as $N(x_t; \sqrt{\bar{\alpha}_t}x_0, (1-\bar{\alpha}_t)\text{I})$ , which can be reparameterized as $x_t(x_0, \epsilon) = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon$ where $\epsilon \sim N(0, \text{I})$ .

Instead of making the decoder learn to produce those intermediate latents (noisy images), we can make the decoder $\epsilon_{\theta}(x_t)$ learn to predict the noise factor, $\epsilon$ , so that it can be subtracted from the intermediate latent (computed in previous iterations) to recover the original image. Hence, the task is simpler for the decoder (since the decoder can focus on recovering the original image at every step during training, instead of learning to graudally move towards the original image), and the loss $L$ can be computed easily by the MSE between the predicted noise factor $\epsilon_{\theta}(x_t)$ and the actual noise factor $\epsilon$ . (This is when the schedule $\beta$ is kept constant. If they are made learnable, we need to consider other terms, which you can check in the original paper cited at the bottom of the article.)

We choose to indirectly denoise by predicting and subtracting the noise instead of directly inferring the denoised image, also because we start from pure standard Gaussian noise. If we were to train the decoder to predict the denoised image directly from the standard Gaussian noise, we would end up predicting the average of all the training images during inference(no useful correlation to exploit), which would become an invalid, blurry mess. Meanwhile, the average of noise samples would still be a valid noise, which can be applied autoregressively to arrive at a valid denoised image. By relatively fixing the encoder to focus on image generation, maximizing the number of uncorrelated pixel values adjusted per iteration by adding and denoising Gaussian noise samples, and making decoder to predict the noise factor, DDPMs improved upon hierarchical VAEs and auto-regressors both in terms of performance and speed.

U-Net & LDM & DiT

For DDPMs, we train the decoder to denoise latents with the same size as the input size, meaning we want the decoder block to have the same input and output size. We have actually seen such an architecture before in the article on semantic segmentation and U-Nets. It turns out that a U-Net architecture, with skip connections for stable gradients, demonstrates strong performance in simple denoising tasks and also as a decoder block in DDPMs.

The U-Net architecture is not limited to convolutional layers and can be used with self-attention layers and even shifted-window attention layers. In fact, the original Stable Diffusion by Stability AI, which received recognition for its impressive image generation quality, utilized a time-conditional U-Net (that attaches sinusoidal time embeddings on top of positional embeddings) with an attention mechanism as the denoising network. Another contribution to the domain from Stable Diffusion is the introduction of the latent diffusion model (LDM). Since we work with latents with the same size as the input, the model tended to become gigantic, making it difficult to train and slow to infer. Hence, LDMs introduced a pretrained VAE for producing smaller latent representations to which we apply denoising, making the model smaller while maintaining performance.

Aside from the U-Net, we have been working extensively with another architecture that has the same input and output size, transformers. As we did for object detection, we can use ViT with no downsampling or bottleneck, unlike U-Net, and make it output the means and variances of the Gaussian distributions for noise, which is called a diffusion transformer (DiT). If we don't care about the large number of parameters (which large institutions often don't, due to their high computational resources), it makes sense to use DiT for DDPM. Even if we do, we can use them in LDM to reduce the latent size and achieve fewer parameters, just like Stable Diffusion 3.

Conclusion

In this article, we covered the metrics IS and FID, commonly used for evaluating image generations, and the natural progression from hierarchical VAEs and auto-regressors to denoising diffusion probabilistic models from a deep learning perspective (mainly machanical, not from a Bayesian perspective nor a perspective of information theory). We also briefly discussed LDMs and found U-Nets and DiT to be the obvious choice of architecture for the denoising network. In the next article, we will discuss conditional image generation, which allows for more control over the generation and opens up practical applications.

Resources

Algorithm Simplicity. 2024. Why Does Diffusion Work Better than Auto-Regression?. ArXiv.
Ho, J. et al. 2020. Denoising Diffusion Probabilistic Models. ArXiv.
mm_0824. 2021. Frechet Inception Distance(FID)を理解する. 楽しみながら学ぶ機械学習 / 自然言語処理入門.
mm_0824. 2021. Inception Scoreを理解する. 楽しみながら学ぶ機械学習 / 自然言語処理入門.
Rombach, R. et al. 2022. High-Resolution Image Synthesis with Latent Diffusion Models. ArXiv.
Tomczak, M. J. n.d. Diffusion-based Models. jmtomczak.
Tomczak, M. J. n.d. Hierarchical VAEs. jmtomczak.