Segment Anything Model and Dataset (SAM and SA-1B): Difference between revisions

no edit summary
No edit summary
Line 16: Line 16:
To train SAM, a massive and diverse dataset was needed. The SA-1B dataset was collected using the model itself; annotators used SAM to annotate images interactively, and the newly annotated data was then used to update SAM. This process was repeated multiple times to iteratively improve both the model and the [[dataset]].
To train SAM, a massive and diverse dataset was needed. The SA-1B dataset was collected using the model itself; annotators used SAM to annotate images interactively, and the newly annotated data was then used to update SAM. This process was repeated multiple times to iteratively improve both the model and the [[dataset]].


A data engine was built for creating the SA-1B dataset, which has three gears:  
A data engine was built for creating the SA-1B dataset, which has three gears:
#model-assisted annotation,  
#model-assisted annotation,
#a mix of fully automatic annotation, and assisted annotation.
#a mix of fully automatic annotation, and assisted annotation.
#fully automatic mask creation.  
#fully automatic mask creation.
The final dataset includes more than 1.1 billion segmentation masks collected on about 11 million licensed and privacy-preserving images.
The final dataset includes more than 1.1 billion segmentation masks collected on about 11 million licensed and privacy-preserving images.


Line 28: Line 28:


Looking ahead, tighter coupling between understanding images at the pixel level and higher-level semantic understanding of visual content could lead to even more powerful AI systems. The Segment Anything project is a significant step forward in this direction, opening up possibilities for new applications and advancements in computer vision and AI research.
Looking ahead, tighter coupling between understanding images at the pixel level and higher-level semantic understanding of visual content could lead to even more powerful AI systems. The Segment Anything project is a significant step forward in this direction, opening up possibilities for new applications and advancements in computer vision and AI research.
==Model Structure and Implementation==
SAM's structure consists of three components:
#A ViT-H image encoder that runs once per image and outputs an image embedding.
#A prompt encoder that embeds input prompts such as clicks or boxes.
#A lightweight transformer-based mask decoder that predicts object masks from the image embedding and prompt embeddings.
The image encoder is implemented in PyTorch and requires a GPU for efficient inference. The prompt encoder and mask decoder can either run directly with PyTorch or be converted to ONNX and run efficiently on CPU or GPU across various platforms that support ONNX runtime.
The image encoder has 632 million parameters, while the prompt encoder and mask decoder together have 4 million parameters.
The image encoder takes approximately 0.15 seconds on an NVIDIA A100 GPU, while the prompt encoder and mask decoder take about 50 milliseconds on a CPU in a browser using multithreaded SIMD execution.
SAM was trained for 3-5 days on 256 A100 GPUs.
==FAQs and Additional Information==
SAM predicts object masks only and does not generate labels. It currently supports images or individual frames extracted from videos, but not videos directly.
The source code for SAM is available on GitHub for users interested in exploring and utilizing the model.


==Reference==
==Reference==
<references />
<references />
370

edits