Road to ML Engineer #55 - Conditional Generation

Last Edited: 4/28/2025

The blog post introduces methods to achieve conditional generation (CFG, CADS, etc.) in computer vision.

ML

In the previous article, we covered progress in image generation using denoising diffusion probabilistic models (DDPMs), potentially built with U-Nets and Transformers. However, we only discussed unconditional image generation, which doesn's allow us to guide the generation to obtain the kinds of images we want and is impractical. Hence, in this article, we will discuss ways to perform conditional image generation.

Classifier Guidance

Dhariwal, P. et al. (2021), inspired by GANs, proposed classifier guidance, a method that uses a pretrained image classifier during inference to guide the unconditional model towards the desired class. The method is based on the observation that the conditional generation pϕ,θ(xtxt+1,y)p_{\phi, \theta}(x_t | x_{t+1}, y) can be expressed as Zpθ(xtxt+1)pϕ(yxt)Zp_{\theta}(x_t|x_{t+1})p_{\phi}(y|x_t), where ZZ is the normalizing constant, pθ(xtxt+1)p_{\theta}(x_t|x_{t+1}) is the unconditional generation, and pϕ(yxt)p_{\phi}(y|x_t) is the classification, implying that conditional generation can be achieved by the unconditional generator θ\theta and classifier ϕ\phi.

After mathematical derivations (available in the cited paper, which is out of scope for this article), they found that subtracting 1αˉtxtlog(pϕ(yxt))\sqrt{1-\bar{\alpha}_t}\nabla_{x_t} log(p_{\phi}(y | x_t)), the classifier gradient representing the changes needed to classify xtx_t as yy, from the noise ϵθ(xt)\epsilon_{\theta}(x_t) can nudge the model towards denoising the latents in the direction of images with the provided class. By scaling the gradient by a hyperparameter ss, we can control how much the classifier gradient nudges the denoising.

The model behaves like the unconditional generation when s=0s=0, and interpolates between unconditional and normal conditional generation when 0<s<10 < s < 1. Though classifier guidance allows us to perform varying degrees of conditional generation during inference without modifying DDPMs, the signals from the classifier on the initial noisy images might be unhelpful or even adversarial. It also adds complexity to the training, as the classifier needs to be trained on those noisy images.

Classifier Free Guidance

In addition to the drawbacks of classifier guidance mentioned previously, the classifier guidance might only be contributing to "deceiving" Inception-v3 to achieve high IS and FID scores by using the gradients from a classifier pretrained on ImageNet, rather than generating realistic images. To consider alternatives, we can analyze the mathematics behind classifier guidance.

p(xtxt+1,y)=Zp(xtxt+1)p(yxt)xlog(p(xtxt+1,y))=xlog(p(xtxt+1))+xlog((yxt)) p(x_t | x_{t+1}, y) = Zp(x_t|x_{t+1})p(y|x_t) \\ \nabla_x log(p(x_t | x_{t+1}, y)) = \nabla_x log(p(x_t|x_{t+1})) + \nabla_x log((y|x_t))

In classifier guidance, we parameterized xlog(pϕ(yxt))\nabla_x log(p_{\phi}(y|x_t)) with a classifier ϕ\phi. However, we can derive it differently to apply classifier guidance without a classifier, or using classifier-free guidance (CFG). Using Bayes' theorem, we can express p(yxt)p(y|x_t) as p(xtxt+1,y)p(y)p(xtxt+1)\frac{p(x_t|x_{t+1}, y)p(y)}{p(x_t | x_{t+1})}. Taking the derivatives of the log with respect to xx, we can substitute xlog(pϕ(yxt))\nabla_x log(p_{\phi}(y|x_t)) as xlog(p(xtxt+1,y))xlog(p(xtxt+1))\nabla_x log(p(x_t|x_{t+1}, y)) - \nabla_x log(p(x_t| x_{t+1})). With the hyperparameter ss, we can rewrite the guidance as follows.

xlog(p(xtxt+1,y))=xlog(p(xtxt+1))+s(xlog(p(xtxt+1,y))xlog(p(xtxt+1)))=(1s)xlog(p(xtxt+1))+sxlog(p(xtxt+1,y)) \nabla_x log(p(x_t | x_{t+1}, y)) = \nabla_x log(p(x_t|x_{t+1})) + s(\nabla_x log(p(x_t|x_{t+1}, y))-\nabla_x log(p(x_t|x_{t+1}))) \\ = (1-s)\nabla_x log(p(x_t|x_{t+1})) + s\nabla_x log(p(x_t|x_{t+1}, y))

This approach results in a gradient xlog(p(xtxt+1,y))\nabla_x log(p(x_t | x_{t+1}, y)) comprising only of conditional generator, unconditional generator, and guidance weight, acting as a hyperparameter controlling the implicit classifier gradient or strength of conditioning. After mathematical derivations, we arrive at the new noise sampler ϵ^θ(xt,y)=(1+s)ϵθ(xt,y)sϵθ(xt)\hat{\epsilon}_{\theta}(x_t, y) = (1+s)\epsilon_{\theta}(x_t, y) - s\epsilon_{\theta}(x_t). Though we need to train the new conditional model on top of the unconditional model, unlike the previous formulation, we can avoid the classifier, which was a major source of the drawbacks. Furthermore, creating a conditional model and training it jointly with the unconditional model is not overly complicated, as we can simply attach a class embedding to the input and occasionally drop it out (typically 10–20%) during training.

We want great control over the class of the generated images, so we generally prefer training the model with high guidance weights. (We also want strong conditioning for models pretrained on large datasets to achieve coherent image generation.) However, it is empirically found that high guidance weights often result in less diversity and highly-saturated images due to excessive guidance that exceeds the normalized pixel value range of [-1, 1]. For the latter, we typically resort to dynamic thresholding, where a dynamic threshold ss, representing the current intensity percentile (typically 99.5%), is chosen and applied to xtx_t after each sampling step.

CADS

Though dynamic thresholding is effective in addressing high saturation to some degree, the original problem of diversity still needs to be addressed. Analyzing the problem, we suspect that the diversity issue could be caused by excessively high guidance weights for earlier stages. While we want to maintain high guidance weights for images with less noise to ensure high quality and validity, applying high guidance weights for images with more noise can overly constrain the generation, leading it to produce the easiest-to-classify image and hurting diversity.

Hence, condition-annealing diffusion sampling (CADS) addresses this by applying scheduled Gaussian noise to the class embedding, from strong noise to weaker noise. This allows the model to be less class-dependent initially for diverse generation and then gradually recover class dependence for high generation quality.

y~=γ(t)y+s1γ(t)ϵ,where ϵN(0,I)γ(t)={1if tτ1τ2tτ2τ1(0,1)if τ1<t<τ20if τ2t \tilde{y} = \sqrt{\gamma(t)}y + s\sqrt{1-\gamma(t)}\epsilon, \text{where}~\epsilon \sim N(0, I) \\ \gamma(t) = \begin{cases} 1 &\text{if } t \leq \tau_1 \\ \frac{\tau_2 - t}{\tau_2 - \tau_1} \in (0, 1) &\text{if } \tau_1 < t < \tau_2 \\ 0 &\text{if } \tau_2 \leq t \end{cases}

Here, τ1\tau_1 and τ2\tau_2 are user-defined thresholds. Since tt goes backward during denoising, this approach applies more noise at the early stages and then gradually decreases the noise level. However, adding Gaussian noise results in y~\tilde{y} having a different mean than the original yy that the model was trained to see, potentially impacting performance. To mitigate this, we can rescale it back to yy and control the extent of rescaling to ensure some level of diversity in image generation, using the following formula.

y^rescaled=y~mean(y~)std(y~)std(y)+mean(y)y^=ψy^rescaled+(1ψ)y~ \hat{y}_{\text{rescaled}} = \frac{\tilde{y} - \text{mean}(\tilde{y})}{\text{std}(\tilde{y})}\text{std}(y) + \text{mean}(y) \\ \hat{y} = \psi \hat{y}_{\text{rescaled}} + (1-\psi)\tilde{y}

Here, ψ\psi is a hyperparameter controlling the extent of rescaling, and y^\hat{y} is passed as the noised class embeddings for the conditional model. With this approach and hyperparameter tuning (which you can check in the original paper cited at the bottom of the article), CADS managed to improve the diversity of image generation for virtually all the pretrained conditional models while maintaining image quality and FID. It also compared CADS with dynamic CFG, which simply uses annealed weights to the conditional noise during inference, and confirmed its superiority regarding both diversity and quality. This superiority could be caused by the change in directions of the class embedding by the noise, which scalar weights cannot achieve.

Conclusion

In this article, we discussed classifier guidance, CFG, and CADS, which are noise sampling methods during inference (that use a pretrained classifier or conditional DDPM on top of an unconditional DDPM) that allow us to control the strength of conditioning and achieve diverse and quality conditional image generation based on class labels. Though we assumed that the condition yy is a class embedding throughout the article, these three methods are not necessarily constrained by using class labels. For example, we can use various conditions, including image embeddings of rough doodles and text embeddings of more detailed captions, for more controlled and fine-grained generation. Therefore, in the next article, we will discuss how we can generate text embeddings that best tie the text and visual information for more practical and performant conditional image generation.

Resources