The Segment Anything Model (SAM) is a foundation model for image segmentation developed by Meta AI Research. Introduced by Alexander Kirillov, Eric Mintun, Nikhila Ravi, and colleagues in April 2023, SAM is designed to perform "promptable segmentation," meaning it can generate segmentation masks for any object in an image given a user-provided prompt such as a point, bounding box, or text description. The model was trained on the SA-1B dataset, which contains over 1.1 billion masks across 11 million images, making it the largest segmentation dataset ever created at the time of release. SAM was presented at the IEEE/CVF International Conference on Computer Vision (ICCV) 2023 and has since become one of the most cited papers in computer vision, with thousands of citations within its first two years.
SAM represents a shift in how segmentation models are built and used. Rather than training task-specific models for each segmentation problem (semantic, instance, panoptic), SAM provides a single general-purpose model that can handle a wide range of segmentation tasks through prompting, similar to how large language models handle diverse text tasks through prompt engineering.
Imagine you have a magic coloring tool for photos. You can point at anything in a picture (a dog, a tree, a cup) and the tool instantly draws a perfect outline around it, separating it from everything else. That is basically what SAM does. You show it a photo and tell it what you are interested in (by clicking on it, drawing a box around it, or even describing it in words), and SAM figures out exactly where that thing is in the picture and draws a boundary around it. The really cool part is that SAM can do this for almost anything in almost any photo, even things it has never seen before, because it was trained on over a billion examples of objects being outlined in millions of photos.
Before SAM, image segmentation models were typically trained for specific tasks or datasets. A model trained to segment medical images would not generalize well to satellite imagery, and a model trained for autonomous driving scenes would struggle with indoor photographs. Each new application domain required collecting labeled data, designing task-specific architectures, and training from scratch.
The SAM project drew inspiration from the success of foundation models in natural language processing (NLP). Models like GPT and BERT demonstrated that training a single large model on massive amounts of data could produce systems capable of zero-shot and few-shot generalization to a wide range of tasks. Kirillov et al. aimed to build an analogous foundation model for segmentation by addressing three interconnected components: a promptable segmentation task, a model architecture capable of handling that task, and a data engine to produce training data at sufficient scale.
SAM's architecture consists of three main components: an image encoder, a prompt encoder, and a mask decoder. The design separates the computationally expensive image encoding from the lightweight prompt processing and mask prediction, allowing a single image embedding to be reused across multiple prompts.
The image encoder is a Vision Transformer (ViT) pre-trained using Masked Autoencoders (MAE). The default configuration uses ViT-H/16 (ViT-Huge with a patch size of 16), which contains approximately 632 million parameters. The encoder processes input images at a resolution of 1024 x 1024 pixels and produces a 64 x 64 spatial embedding (a 16x downscaled representation). The ViT-H variant uses 14 x 14 windowed attention with four equally spaced global attention blocks. After the transformer, a 1 x 1 convolution reduces the channel dimension to 256, followed by a 3 x 3 convolution with layer normalization.
Ablation studies in the original paper showed that ViT-H improved substantially over ViT-B (ViT-Base) but offered only marginal gains over ViT-L (ViT-Large), suggesting diminishing returns from further scaling the encoder.
The prompt encoder handles two types of prompts: sparse prompts and dense prompts.
Sparse prompts include points, bounding boxes, and text:
| Prompt type | Encoding method |
|---|---|
| Points | Positional encoding of the point's location summed with a learned embedding indicating foreground or background |
| Bounding boxes | Embedding pair representing the top-left and bottom-right corners, each using positional encoding plus a learned corner embedding |
| Text | Text encoder from CLIP converts free-form text into text embeddings |
Dense prompts consist of input masks (for example, from a previous prediction round). Mask prompts are input at 4x lower resolution than the image and processed through two 2 x 2 stride-2 convolutions before being added element-wise to the image embedding.
The mask decoder is a lightweight module that takes the image embedding and prompt embeddings and produces segmentation masks. It uses a modified Transformer decoder with two layers. Each decoder layer performs four operations in sequence:
The embedding dimension throughout the decoder is 256. After the two decoder layers, the model upsamples the image embedding and uses MLP heads to map each output token to a dynamic linear classifier that generates a mask at the desired spatial resolution.
Ambiguity-aware output: A single prompt (for example, a single point click) can be ambiguous because it might refer to different objects at different scales. To handle this, SAM predicts multiple output masks (three by default) for each prompt, along with a confidence score (estimated IoU) for each mask. During training, only the mask with the minimum loss is backpropagated, allowing the model to learn to produce diverse, plausible interpretations of ambiguous prompts.
SAM is trained using a combination of focal loss and dice loss. The training procedure simulates an interactive segmentation setup by randomly sampling prompts across 11 rounds per mask, mimicking how a human annotator would iteratively refine selections. The mask decoder processes prompts in approximately 50 milliseconds in a web browser, enabling real-time interactive use.
The SA-1B (Segment Anything 1 Billion) dataset is the training dataset created specifically for SAM. At the time of release, it was the largest segmentation dataset ever assembled.
| Property | Value |
|---|---|
| Total images | 11 million |
| Total masks | 1.1 billion |
| Average masks per image | ~100 |
| Image source | Licensed, privacy-protecting photographs |
| Comparison to prior datasets | 400x more masks than any existing segmentation dataset |
The dataset was created through a three-stage "data engine" that progressively reduced the need for human annotation.
The data engine is one of SAM's most significant contributions. It describes a bootstrapping loop where the model and the dataset are improved iteratively.
Stage 1: Assisted-manual annotation. Professional annotators used a browser-based interactive segmentation tool powered by an early version of SAM to label masks by clicking foreground and background points. As the model improved over the course of this stage, the average annotation time per mask decreased from 34 seconds to 14 seconds. This stage produced 4.3 million masks from 120,000 images.
Stage 2: Semi-automatic annotation. SAM was used to automatically generate masks for a subset of objects in each image by prompting it with a grid of candidate points. Human annotators then focused on annotating the remaining objects that the model missed, increasing mask diversity. This stage yielded 5.9 million masks from 180,000 images. The average number of masks per image increased from 44 to 72 compared to Stage 1.
Stage 3: Fully automatic annotation. The fully trained model generated masks autonomously using a 32 x 32 regular grid of foreground points as prompts. Confidence-based filtering removed low-quality predictions, and non-maximum suppression (NMS) eliminated duplicate masks. This stage produced the bulk of the dataset, yielding an average of approximately 100 high-quality masks per image across the full 11 million images.
The SA-1B images were sourced from a licensed image provider and underwent face-blurring to protect privacy. Analysis of the dataset showed that automatically generated masks had comparable quality to professionally annotated masks, with 94% of mask pairs having IoU greater than 90% when compared to professional re-annotations. The dataset covers a diverse range of geographic regions and image content.
One of SAM's defining features is its ability to generalize to new tasks and domains without additional training. The original paper evaluated SAM on several zero-shot transfer benchmarks.
On a benchmark of 23 diverse datasets, SAM outperformed the prior state-of-the-art interactive segmentation method RITM on 16 of 23 datasets in terms of mIoU. When using an "oracle" selector that picks the best of SAM's three predicted masks, SAM surpassed RITM on all 23 datasets. Human evaluators consistently rated SAM's mask quality higher than RITM's, with average ratings between 7 and 9 on a 10-point scale, even on datasets where SAM trailed RITM by automatic metrics.
SAM was evaluated on the BSDS500 edge detection benchmark by generating 768 masks from a 16 x 16 prompt grid and extracting edges. Despite not being trained for edge detection, SAM produced reasonable edge maps and achieved an ODS (Optimal Dataset Scale) F-measure of 0.768 and an AP (Average Precision) of 0.794. These scores exceeded classical methods but fell short of state-of-the-art supervised edge detectors.
On the LVIS dataset, SAM achieved a mask AR@1000 (Average Recall at 1000 proposals) of 59.3, outperforming ViTDet-based detector baselines on medium and large objects, as well as rare and common object categories.
When composed with a ViTDet object detector, SAM achieved 46.5 AP on COCO and 44.7 AP on LVIS for instance segmentation. Human studies found that SAM produced qualitatively superior mask boundaries compared to the detector's built-in segmentation head, even when automatic metrics were slightly lower.
SAM 2 was introduced by Nikhila Ravi, Valentin Gabeur, and colleagues from Meta AI in July 2024. It extends the original SAM to handle video segmentation while also improving image segmentation performance.
SAM 2 replaces the original ViT encoder with a Hiera image encoder, a hierarchical vision transformer pre-trained with MAE. The Hiera encoder enables multiscale feature extraction and processes each video frame independently.
The key architectural addition for video support is a memory attention module. This component uses stacked transformer blocks to condition the current frame's features on information from previous frames. It performs self-attention followed by cross-attention to memories stored in a memory bank.
| Component | Function |
|---|---|
| Image encoder (Hiera) | Extracts per-frame visual features using a hierarchical ViT |
| Memory attention | Conditions current frame features on past frames via cross-attention |
| Memory encoder | Fuses mask predictions with frame embeddings through convolutions |
| Memory bank | FIFO queue storing up to N recent frame memories, M prompted frame memories, and object pointer vectors |
| Prompt encoder | Same as SAM; accepts clicks, boxes, or masks |
| Mask decoder | Same as SAM with an added occlusion prediction head |
The memory bank stores spatial feature maps from recent frames in a first-in, first-out (FIFO) queue, along with lightweight "object pointer" vectors that carry semantic information about tracked objects. An occlusion prediction head was added to the mask decoder to determine whether the target object is visible in the current frame, allowing SAM 2 to handle temporary disappearances.
SAM 2 was trained on the SA-V (Segment Anything Video) dataset, which was built through a three-phase data engine similar to SAM's approach.
| Property | Value |
|---|---|
| Total videos | 50,900 |
| Average video duration | 14 seconds |
| Manual masklets | 190,900 |
| Automatic masklets | 451,700 |
| Total masklets | 642,600 |
| Total masks | 35.5 million |
| Scale vs. prior VOS datasets | ~53x larger |
| Geographic coverage | 47 countries |
The three-phase data engine improved annotation efficiency progressively: Phase 1 (frame-by-frame SAM assistance) took 37.8 seconds per frame, Phase 2 (SAM 2 mask propagation) reduced this to 7.4 seconds per frame (a 5.1x speedup), and Phase 3 (full SAM 2 with interactive refinement) further reduced it to 4.5 seconds per frame (an 8.4x speedup over Phase 1).
SAM 2 achieved better accuracy on video segmentation benchmarks while requiring approximately 3x fewer user interactions compared to prior approaches. In semi-supervised video object segmentation (VOS) evaluation across 17 datasets, SAM 2 outperformed state-of-the-art methods including XMem, Cutie, and DEVA by 8 to 13 percentage points in J&F metrics.
For image segmentation, SAM 2 achieved higher accuracy than SAM while running at 6x faster speed. Its 1-click mIoU improved from 58.9% (SAM) to 61.4% (SAM 2) on the 23-dataset SAM benchmark when trained on combined video and image data.
Meta released SAM 3 in November 2025, extending the model family with open-vocabulary concept segmentation. SAM 3 accepts text prompts in the form of open-vocabulary short noun phrases and image exemplar prompts, enabling it to find and segment all instances of a concept in an image or video without being limited to a fixed label set.
SAM 3 was trained using an automated data engine that annotated over 4 million unique concepts, creating the largest open-vocabulary segmentation dataset to date. On Meta's SA-CO benchmark (containing 270,000 unique concepts), SAM 3 achieved 75 to 80 percent of human performance and doubled the accuracy of prior systems in both image and video promptable concept segmentation.
Alongside SAM 3, Meta released SAM 3D, a suite of models for 3D object and human reconstruction from single images.
The original SAM's ViT-H image encoder contains approximately 632 million parameters, making it computationally expensive for mobile and edge deployment. Several lightweight variants have been developed to address this.
| Model | Authors | Year | Image encoder | Parameters | Speed (per image) | Approach |
|---|---|---|---|---|---|---|
| SAM (original) | Kirillov et al. | 2023 | ViT-H | ~632M | ~500ms (GPU) | MAE-pretrained ViT |
| FastSAM | Zhao et al. | 2023 | YOLOv8 | ~68M | ~40ms | CNN-based; YOLOv8 + YOLACT |
| MobileSAM | Zhang et al. | 2023 | TinyViT | ~9.66M | ~10ms | Knowledge distillation from ViT-H |
| EfficientSAM | Xiong et al. | 2024 | ViT-Ti / ViT-S | ~10M / ~25M | Varies | SAMI pretraining (masked image distillation) |
FastSAM (Zhao et al., 2023) takes a fundamentally different approach from SAM by replacing the entire encoder-decoder pipeline with a CNN-based architecture. It decomposes the segment anything task into two stages: (1) all-instance segmentation using a YOLOv8-seg model trained on only 2% of the SA-1B dataset, and (2) prompt-guided selection of the relevant mask. FastSAM achieves approximately 50x speedup over SAM. However, it tends to produce lower-quality masks, particularly for small objects and fine boundaries.
MobileSAM (Zhang et al., 2023) keeps SAM's original prompt encoder and mask decoder but replaces the ViT-H image encoder with TinyViT, a compact vision transformer with approximately 5.78 million parameters. The key contribution is a "decoupled distillation" strategy that transfers knowledge from SAM's ViT-H encoder to TinyViT. This approach uses less than 1% of the computational resources required by coupled distillation while achieving superior performance (mIoU of 0.75 versus 0.72 for coupled distillation). MobileSAM is approximately 60x smaller than SAM and around 5x faster than FastSAM.
EfficientSAM (Xiong et al., 2024) retains SAM's encoder-decoder architecture but uses lightweight encoders (ViT-Tiny or ViT-Small) trained with a novel SAMI (SAM-based Masked Image) pretraining strategy. In SAMI pretraining, the SAM ViT-H encoder generates feature embeddings that serve as reconstruction targets for the lightweight encoder, effectively distilling knowledge through masked image modeling. EfficientSAM was published at CVPR 2024.
SAM's zero-shot generalization has led to its adoption across many domains.
SAM and its successors have been widely evaluated for medical image segmentation tasks, including organ segmentation in CT scans, tumor detection in MRI, retinal vessel segmentation, and surgical tool segmentation. SAM 2's video tracking capabilities have been adapted for automated 3D medical image segmentation by treating CT/MRI slice sequences as video frames. Specialized adaptations such as MedSAM and MedicoSAM have been developed to improve performance on medical data through domain-specific fine-tuning. Results are mixed: SAM performs reasonably well on structures with clear boundaries (such as benign tumors) but struggles with objects that have unclear boundaries (such as malignant tumors that infiltrate surrounding tissue).
In remote sensing, SAM has been applied to land cover mapping, urban monitoring, cropland parcel extraction, and solar panel segmentation. The SAMRS dataset, created using SAM, contains 105,090 remote sensing images with 1,668,241 instances. SAM works well for large, well-defined structures but has difficulty with slender objects like roads and farm parcel boundaries.
SAM has been integrated into robotic perception pipelines for object grasping, scene understanding, and manipulation tasks. In autonomous driving, SAM has been used for generating segmentation datasets from sensor data and for fusing segmentation with LiDAR and radar modalities. Its real-time interactive capabilities make it suitable for human-robot collaborative scenarios.
SAM has also been applied to video editing (object removal and replacement), augmented reality (real-time object masking), agriculture (crop and weed segmentation), environmental monitoring (wildlife tracking, deforestation mapping), and content creation (automated background removal and image compositing).
Despite its broad capabilities, SAM has several well-documented limitations.
Fine-grained segmentation. SAM struggles with objects that have intricate structures, fine details, or sharp boundaries. Masks for objects with thin protrusions, branching structures (such as retinal blood vessels or tree branches), or complex textures tend to be overly smooth or incomplete.
Low-contrast and camouflaged objects. Objects that blend into their backgrounds or have unclear boundaries pose problems for SAM. This includes camouflaged animals, shadow regions, and medical structures like malignant tumors that lack well-defined edges.
Small objects. SAM's performance degrades on very small objects, particularly when using automatic mask generation with point grid prompts. The 16x spatial downsampling in the image encoder limits the spatial resolution available for detecting fine-scale objects.
Domain-specific performance gaps. While SAM generalizes well across many domains, its zero-shot performance on specialized datasets (medical imaging, satellite imagery, microscopy) often falls short of domain-specific models that have been trained or fine-tuned on relevant data.
Semantic understanding. SAM produces class-agnostic masks; it segments objects without identifying what they are. This limits its usefulness for tasks that require both segmentation and classification. SAM 3 addresses this limitation partially through open-vocabulary concept segmentation.
Adversarial robustness. Studies have shown that SAM is moderately resilient against FGSM (Fast Gradient Sign Method) adversarial attacks but vulnerable to PGD (Projected Gradient Descent) attacks, even with very small perturbation magnitudes.
Computational cost. The original SAM with ViT-H requires substantial GPU memory and compute for the image encoder pass, making it impractical for edge and mobile deployment without using a lightweight variant.
SAM has had a broad impact on the computer vision community. The SA-1B dataset has been described as the "ImageNet of segmentation," analogous to how the ImageNet dataset catalyzed progress in image classification a decade earlier. The release of SAM's weights and code under the Apache 2.0 license has enabled extensive follow-up research, with thousands of papers building on or evaluating SAM within two years of its release.
The model demonstrated that the foundation model paradigm, previously validated mainly in NLP and multimodal learning, could be extended to dense prediction tasks like segmentation. This has inspired similar efforts in other areas of computer vision, including depth estimation, object detection, and 3D reconstruction.