Road to ML Engineer #57

The blog post introduces AdaIN (and other noramlization techniques) in computer vision.

In the previous article on conditional DDPM, it was stated that creating a conditional model only involves attaching the condition embedding, though this is not what is always done in practice. In DiT, the condition embedding is used for adaptive layer normalization (AdaLN), a technique based on adaptive instance normalization (AdaIN) that stemmed from instance normalization. Hence, in this article, we will discuss several normalization techniques, which we have not previously discussed, used in various computer vision tasks and that have led to AdaLN in DiT.

Group & Instance Normalization

Layer normalization solved the problems of batch normalization, which include poor performance for small patches, an inability to handle batches containing data with varying lengths, and different operations during training and inference. However, layer normalization did not perform as well on images compared to batch normalization, likely because batch normalization normalizes each channel while layer normalization normalizes each image irrespective of the channels, potentially disrupting important signals. As an attempt to improve the performance of layer normalization, group normalization normalizes a group of channels for each image, and instance normalization normalizes each channel of each image.

The above visualization depicts batch, layer, group, and instance normalization on an image batch, with each color representing a channel. By respecting the channels while performing normalization on each image, group normalization achieved competitive performance with batch normalization for large image batches and maintained its performance for smaller image batches (approximately 16 images per batch) to outperform batch normalization for smaller batches. Unlike group normalization, however, instance normalization performed relatively poorly, potentially because it cannot capture the correlations between channels and simply normalizes fewer values, leading to weaker normalization.

CIN & AdaIN

Though instance normalization did not work as well as batch normalization for image classification tasks, it proved more effective in style transfer, where the model aims to transfer the style of one image to another (e.g., transforming a photo-realistic image into the style of "The Starry Night"). This is likely because natural style transfer requires different adjustments to channel distributions for different images, and instance normalization provides the flexibility to perform these adjustments.

For more control over style transfer, conditional instance normalization (CIN) utilizes a different set of learnable parameters for each style ( $\gamma_s$ and $\beta_s$ ), where style $s$ is chosen randomly during training and manually during inference. While CIN performs well to some extent, it requires more learnable parameters and can only select preset styles. Therefore, adaptive instance normalization (AdaIN) substitutes the learnable parameters, $\gamma_s$ and $\beta_s$ , with $\sigma(s)$ and $\mu(s)$ , where $s$ is the embedding of the image whose style is to be transferred. Due to these adjustments, AdaIN achieved high performance, robustness, and flexibility for style transfer with fewer parameters.

AdaLN

Instead of attaching CLIP embeddings (or any other condition embeddings) to the input patch embeddings, DiT uses adaptive layer normalization (AdaLN), where the standard deviation and mean, $\sigma(c)$ and $\mu(c)$ , are used instead of the learnable parameters in standard layer normalization, so that the distributions of the predicted noise are influenced by the distributions of the conditions. This method avoids having repetitive condition signals on top of varying positional embeddings (DiT adds time and class/CLIP embeddings to arrive at conditional embeddings for AdaLN) and naturally shifts the distributions, similar to style transfer with AdaIN, for high-quality and robust conditional image generation.

NOTE: AdaLN is not a requirement for performant conditional image generation. DiffiT (Diffusion Vision Transformer), which achieved a better FID score than LDM and DiT, simply takes the weighted sum of the conditional embeddings (class/CLIP embeddings + time embeddings) and the input image patches and uses relative positional bias (as seen in Swin Transformers) for self-attention in the latent space.

Conclusion

In this article, we introduced new normalization techniques, building on batch and layer normalization, group and instance normalizations, CIN, AdaIN, and AdaLN—in the context of computer vision. They all have strengths and weaknesses, and it's important to choose the technique that is most appropriate depending on the use case.

Resources

Hatamizadeh, A. et al. 2024. DiffiT: Diffusion Vision Transformers for Image Generation. ArXiv.
Huang, X. & Belongie, S. 2017. Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization. ArXiv.
Ulyanov, D. et al. 2017. Instance Normalization: The Missing Ingredient for Fast Stylization. ArXiv.
Wu, Y. & He, Y. 2018. Group Normalization. ArXiv.