The blog post introduces Segment Anything models in computer vision.

Before discussing image generation, our focus has been on object detection models and their components, which can be used for instance segmentation (or panoptic segmentation). However, data collection for segmentation tasks has always been difficult, limiting their effectiveness and accessibility. In response to these limitations, researchers at Meta developed the Segment Anything Model (SAM), trained to perform (zero-shot) promptable visual segmentation, which achieves competitive performance compared to fully supervised segmentation models and even outperforms them on some tasks, like CLIP. Hence, in this article, we will cover the basics of SAM and its improvements.
Data Engine
To train a large zero-shot segmentation model and achieve state-of-the-art performance and high generalization capabilities, collecting massive, high-quality, and diverse data beyond existing datasets was unavoidable, though manually annotating all individual pixels is hard and time-consuming. Therefore, the researchers created a data engine, or a model-in-the-loop dataset annotation system, with three stages. In the first stage, called assisted-manual, humans use an interactive tool to annotate objects with high-confidence labels, assisted by a SAM pretrained on existing public datasets. In the second stage, semi-automatic, SAM provides these high-confidence masks, allowing humans to focus on validating them and annotating other remaining objects for diversity.
In the final stage, fully automatic, grid points were passed to SAM, and Non-Maximum Suppression (NMS) was applied to the mask predictions with high IoU across different thresholds, resulting in confident, stable, and non-duplicate masks. This data engine produced SA-1B, which contains 11 million images and 1.1 billion masklets (masks for objects and object parts), far exceeding the 100,000 images and 1.2 million masks found in COCO and ImageNet. Further analysis reveals that SA-1B contains more masks per image and a more even distribution of object masks across images compared to other datasets. They also found that SA-1B contains a higher proportion of images from middle-income countries than other major datasets, although it still underrepresents low-income countries, similar to other datasets.
Segment Anything
SAM is comprised of an image encoder (ViT), a prompt encoder (Transformer), and a mask decoder that applies cross-attention between the image and prompt embeddings. The prompt encoder processes prompt embeddings (positional embeddings from coordinate points and bounding boxes, or CLIP embeddings from text prompts) and learnable embeddings corresponding to the mask predictions (which I call mask queries, inspired by object queries), and leranable IoU token. The cross-attention produces the decoded image embedding and prompt embedding, and we take the dot product between the decoded mask queries and the decoded image embeddings, upscaled to the default image size, to arrive at masked outputs.

The learnable embedding for IoU scores is passed to a head to obtain IoU scores for those masks, which can be used for thresholding. The above illustrates the architecture of SAM. When image embeddings can be encoded beforehand, such as in dataset annotation, the model can run in real-time on a browser (although the original paper acknowledges that the inference speed is not fast enough for real-time inference and is one of the limitations of SAM). After training on SA-1B, SAM achieved competitive performance compared to simple extensions of object detectors on zero-shot instance segmentation (where bounding boxes from transformer-based object detectors are used as prompts) and even outperformed them in terms of human evaluations of mask quality.
Segment Anything 2
While SAM works relatively well on images, most segmentation tasks involve processing video frames, and SAM does not capture temporal relationships and thus performs relatively poorly compared to fully supervised models. Hence, Segment Anything 2 (SAM2) introduces a memory bank and memory attention to condition the predictions based on information from previous frames. The memory encoder processes the masked outputs with convolutional layers, adds them elementwise with the image embeddings (straight from the image encoder), and fuses them by passing them to lightweight convolutional layers to obtain a spatial map. Then, the memory bank stores N previous spatial maps from the memory encoder, M prompt embeddings, and previous decoded mask queries (called object pointers in the original paper).

The stored embeddings are used in the memory attention block on image embeddings, where self-attention is performed for image embeddings, followed by cross-attention with those embeddings. The memory attention uses rotary positional encoding (which rotates vectors in various degrees depending on the position to capture relative positions without complicated computations). The resulting image embeddings are called memory-conditioned frame encoding, which is passed to the mask decoder instead of the unconditioned frame encoding (image embedding from the image encoder). The above describes the architecture of SAM2. As we can observe, the architecture remains largely the same as the previous version, except for the memory attention, memory encoder, and memory bank mentioned.
However, there are some minor differences in the image encoder, prompt encoder, and masked decoder as well. The image encoder uses a simple hierarchical ViT (Hiera), pretrained as a Masked Autoencoder (MAE) (trained to recover the original image from a masked image that allows the encoder to learn spatial bias without complicated specialized modules like shifted-window approach). The prompt encoder processes an additional token, an occlusion token, which is further processed by the mask decoder to predict whether the object of interest is occluded, addressing a unique challenge of video around object occlusion.
Aside from the model architecture, SAM2 also developed a data engine for model-in-loop video dataset annotation, which goes through three phases, similar to SAM. This starts with manually annotating every frame of 6 FPS with assistance from SAM, moves on to utilizing SAM2 for in-between frames, and ends with using SAM2 with minor refinements by the annotators. The data engine resulted in SA-V, containing 50.9K videos and 642.6 masklets , which is substantially larger than existing datasets. Training SAM2 on SA-V and images jointly, SAM2 achieved state-of-the-art performance in promptable video segmentation and even outperformed SAM on image segmentation in both quality and speed (likely due to the efficient and performant image encoding by Hiera and high quality dataset).
Conclusion
In this article, we covered the basics of SAM and SAM2. They reconfirmed the importance of data quantity and quality and demonstrated that incorporating additional context from prompts for high generalizability and zero-shot performance is a viable strategy for segmentation tasks as well. However, SAM2 still struggles with smaller objects, fast-moving objects, multiple identically looking objects, objects in crowded scenes, and objects occluded for extended periods (essentially those that are challenging even for humans). Also, they may not yet perform as well as fully supervised models in some tasks and data in practice, so it is important to always evaluate model performance fairly depending on the problem at hand.
Resources
- Gaiduk, M. 2024. SAM - Segment Anything model for promptable pixel segmentation. YouTube.
- Kirillov, A. et al. 2023. Segment Anything. ArXiv.
- Ravi, N. et al. 2024. SAM 2: Segment Anything in Images and Videos. ArXiv.
- Ryali, C. et al. 2023. Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles. ArXiv.
- ymgc3. 2024. Rotary Positional Embeddings (RoPE) とは. Qiita.