Road to ML Engineer #56 - Language-Image Pre-Training

Last Edited: 5/1/2025

The blog post introduces CLIP (and SigLIP) in computer vision.

ML

In the previous article, we discussed how class labels can be used to train the conditional image generator and guide its inference to generate diverse and high-quality images with CFG and CADS. However, as mentioned, we would like to have greater and more granular control over the generation than class labels alone, and text prompts offer that capability. Hence, in this article, we will discuss CLIP and SigLIP, which allow us to obtain useful text embeddings for image generation and other computer vision tasks, and have various other use cases.

Zero-Shot CLIP

Inspired by the success of task-agnostic web-scale pre-training in NLP, OpenAI explored the possibility of adopting the same technique in computer vision and invented CLIP. CLIP (Contrastive Language-Image Pre-training) consists of image and text encoders, trained to map image and caption pairs from the web to similar embeddings (and dissimilar embeddings for non-matching combinations) in the same latent space, so that the embeddings contain useful visual and text semantics.

CLIP

One of the primary use cases of these embeddings is zero-shot classification, where a model performs classification on a test set without being trained on that dataset. By learning to convert images and texts into embeddings in the same latent space, we can perform zero-shot transfer to any classification task by utilizing the vector similarities between the embeddings of the image and texts describing candidate classes (e.g., "A photo of {label}").

The above visualizes the mechanism of CLIP. (We can use cosine similarities with learnable temperature. For more details on the approach, I recommend checking out the original paper cited at the bottom of the article.) We can choose arbitrary combinations of the image and text encoders, such as ResNet50 or ViT (and its variants), and a transformer. Despite its simplicity, with prompt engineering that adds context to disambiguate class labels (e.g., "boxer" as a type of athlete versus "boxer" as a breed of dog), zero-shot CLIP achieved exceptional performance in some image classification tasks (generic, simple, and concrete tasks) that exceeds that of some fully supervised image classifiers.

Representation Learning

It also achieves impressive performance in representation learning, where the embeddings are utilized for various downstream computer vision tasks. The original paper compared linear models on features from CLIP and other pretrained models (a linear probe) for several tasks, including geo-location, optical character recognition, and facial expression recognition, and observed the superior performance of CLIP features in many of these tasks. It also observed that CLIP models are more robust to natural distribution shifts than models pretrained on ImageNet, which can be attributed to the wide variety of data and captions used as a regularizer, encouraging the models to learn general semantics (though it still generalizes poorly on truly out-of-distribution data).

The original paper studied CLIP extensively in various aspects and identified several limitations, such as a tradeoff between effective robustness and task-specific performance, sample efficiency in few-shot learning, the infeasible amount of compute necessary for state-of-the-art performance on some tasks based on scaling laws, and the bias that comes with web-scaled pretraining. It also lacks the capability of generating captions, and it is not trained to perform more granular tasks like object detection (I highly recommend checking out the paper for more details on the analysis). However, CLIP presented an interesting research direction toward achieving high performance and robustness in many computer vision tasks (including text-image retrieval) without requiring task- and domain-specific knowledge and skills.

The above mainly discussed the usage of embeddings by the image encoder for downstream tasks, though there are various use cases of embeddings by the text encoder as well. Here, we can refer back to the context of conditional image generation, as the embeddings by the text encoder in CLIP, with rich visual semantics, are primarily used as the condition for conditional image generation by DDPMs (with CADS) instead of class embeddings, which provides useful signals for the generator and gives us greater control over the generated images. The discussion reveals the significance of CLIP in the recent advancement of computer vision and multimodal learning.

SigLIP

One of the drawbacks of CLIP is the memory required during training, which stems from the need to materialize the entire matrix of image and text embeddings before applying the categorical cross-entropy loss. Despite this memory inefficiency, the quality of CLIP models depends on a sufficiently large batch size, as a larger batch provides more contrastive objectives for optimization (though excessively large batch sizes can worsen label imbalance and negatively impact performance).

To make loss computation for each pair parallelizable, avoid the necessity of materializing the matrix, and potentially improve model performance with smaller batches, SigLIP reframes the problem as a binary classification problem with labels [-1, 1] and uses a log-sigmoid loss (rather than standard binary cross-entropy) for efficient loss computation. The loss is formulated as follows.

Li,j=log(11+ezi,j(txiyj+b))L=1Bi=1Bj=1BLi,j L_{i,j} = -log(\frac{1}{1+e^{z_{i,j}(-tx_i \cdot y_j + b)}}) \\ L = \frac{1}{|B|} \sum_{i = 1}^{|B|}\sum_{j = 1}^{|B|} L_{i,j}

Here, xix_i and yjy_j are image and text embeddings, zi,jz_{i,j} is the label (1 for diagonal entries and -1 for the rest), and tt and bb are learnable parameters. When the label is positive, the sigmoid function behaves as usual, making the value larger and the negated loss smaller as xiyjx_i \cdot y_j decreases. When the label is negative, the sigmoid function flips, making the loss larger as xiyjx_i \cdot y_j increases. Since negative labels will dominate training, the learnable parameters tt and bb are initialized as log(10)log(10) and 10-10, respectively, to reduce the extremity of the loss for negative labels and ease training.

With this formulation, SigLIP outperformed CLIP on ImageNet zero-shot transfer for smaller batch sizes, and SigLIP achieved peak performance at a 32K batch size, slightly exceeding that of CLIP at 98K batch size. (Further increases in batch size diminished the performance gain of SigLIP and also degraded the performance of both models.) It was also observed that SigLIP is potentially more robust to noise than CLIP, though its robustness to natural distribution shift has not been studied.

Conclusion

In this article, we covered CLIP and SigLIP, which can be utilized for zero-shot transfer and representation learning for downstream tasks, including conditional image generation. Though SigLIP has reported slightly higher performance and robustness to noise, and significant memory requirements during training, CLIP remains a popular choice because it is well-studied (in terms of robustness to natural distribution shift, biases, and limitations) and has stood the test of time, and the cost of pre-training is irrelevant for most researchers working on downstream tasks. As a matter of fact, many recent conditional image generators are trained to use CLIP with CADS sampling instead of SigLIP, though using SigLIP remains a viable choice.

Resources