The Segment Anything Model (SAM) is a promptable image segmentation foundation model developed by Meta AI Research (FAIR). It was introduced in the paper "Segment Anything" by Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick. The paper was published at the IEEE/CVF International Conference on Computer Vision (ICCV) in October 2023, where it received the Best Paper Honorable Mention award. SAM was publicly released alongside its code and model weights in April 2023.
SAM represents a paradigm shift in segmentation. Traditional segmentation models are trained for specific categories or tasks, such as segmenting cars, people, or organs. SAM, by contrast, is a general-purpose segmentation system that can segment any object in any image when given an appropriate prompt. This "promptable" design draws on the success of foundation models in natural language processing, where models like GPT and BERT demonstrated that large-scale pretraining on broad data produces systems that generalize to many downstream tasks without task-specific fine-tuning.
The project introduced three tightly connected contributions: a new promptable segmentation task, the SAM model architecture, and the SA-1B dataset containing over 1.1 billion segmentation masks across 11 million images. SAM demonstrated strong zero-shot transfer performance, often matching or exceeding fully supervised models that were trained on specific benchmarks.
The "promptable segmentation" task is the conceptual foundation of the SAM project. In this formulation, a valid segmentation mask is returned for any prompt at inference time. A prompt can take several forms:
The task is intentionally designed to be general. Rather than predicting a fixed set of semantic categories, the model outputs a binary mask for whatever the user indicates through their prompt. This makes SAM applicable to a wide range of scenarios without retraining.
A key design consideration is ambiguity. A single point on an image can be ambiguous: clicking on a person's shirt could refer to the shirt, the person, or the entire group of people. SAM addresses this by predicting multiple valid masks at different levels of granularity for each prompt and ranking them by predicted quality. The model outputs three candidate masks (whole, part, and subpart) along with confidence scores, allowing downstream applications to select the most appropriate segmentation.
SAM's architecture is divided into three components: an image encoder, a prompt encoder, and a mask decoder. The design is deliberately factored so that the computationally heavy image encoder runs once per image, while the lightweight prompt encoder and mask decoder can run many times with different prompts at interactive speeds.
The image encoder is a Vision Transformer (ViT) pretrained using Masked Autoencoders (MAE). In its default and largest configuration, SAM uses a ViT-H/16 backbone, meaning the transformer processes image patches of 16x16 pixels.
The encoder accepts input images at a resolution of 1024x1024 pixels. It processes the image through the ViT backbone, which uses 14x14 windowed attention with four equally spaced global attention blocks, and outputs a dense feature embedding of size 64x64x256. This 256-channel feature map is a 16x downscaled representation of the original image. The image encoder is the most computationally expensive component of SAM, accounting for the vast majority of the model's parameters and inference time. Once computed, the resulting embedding can be cached and reused across many different prompts.
The prompt encoder translates user-provided prompts into embeddings that the mask decoder can process. SAM handles two categories of prompts: sparse prompts (points, boxes, and text) and dense prompts (masks).
For sparse prompts, points and boxes are represented using positional encodings summed with learned type embeddings that distinguish foreground points, background points, and box corners. The positional encoding is based on random Fourier features, where normalized (x, y) coordinates are multiplied by a random matrix of spatial frequencies. Text prompts, when used, are encoded through a text encoder from CLIP.
For dense prompts, input masks are downscaled to a lower resolution and processed through a small series of convolutional layers. The resulting dense embedding is then summed element-wise with the image embedding before being passed to the mask decoder.
The mask decoder is a lightweight module that combines the image embedding with prompt embeddings to produce segmentation masks. It is built on a modified two-layer transformer decoder architecture that uses bidirectional cross-attention.
The decoder operates as follows:
This lightweight design yields approximately 50 milliseconds of prompt-to-mask latency when the image embedding is already cached, enabling real-time interactive segmentation in a web browser.
SAM is available in three sizes based on different ViT backbone configurations. All three variants share the same prompt encoder and mask decoder; they differ only in the image encoder.
| Variant | Image Encoder | Total Parameters | Checkpoint Size | Relative Performance |
|---|---|---|---|---|
| SAM ViT-B | ViT-Base | ~91 million | 375 MB | Good baseline performance |
| SAM ViT-L | ViT-Large | ~308 million | 1.25 GB | Strong performance, close to ViT-H |
| SAM ViT-H | ViT-Huge (default) | ~636 million | 2.56 GB | Highest accuracy |
The image encoder dominates the parameter count. In the ViT-H variant, the image encoder alone contains roughly 632 million of the total 636 million parameters. The prompt encoder and mask decoder together add only about 4 million parameters.
ViT-H provides the highest accuracy across benchmarks. ViT-L delivers performance very close to ViT-H with fewer parameters and faster inference. ViT-B is the most compact option and is suitable for resource-constrained environments, though it shows notably lower accuracy on challenging segmentation scenarios.
SAM was trained on SA-1B (Segment Anything 1 Billion), the largest segmentation dataset ever created at the time of its release. SA-1B contains 1.1 billion segmentation masks across 11 million high-resolution, licensed, and privacy-respecting images.
| Statistic | Value |
|---|---|
| Total images | 11,000,000 |
| Total masks | 1,100,000,000 |
| Average masks per image | ~100 |
| Image resolution | Shortest side downsampled to 1,500 pixels |
| Masks generated automatically | 99.1% |
| Geographic coverage | Images from over 200 countries |
| Dataset size compared to OpenImages v5 | 6x larger |
The SA-1B dataset was built through a novel three-stage "data engine" that used SAM itself in an iterative model-in-the-loop annotation process.
Stage 1: Assisted-Manual. Professional annotators labeled masks by clicking foreground and background points using a browser-based interactive segmentation tool powered by an early version of SAM. Annotators could refine masks with pixel-precise brush and eraser tools. This stage produced approximately 4.3 million masks from 120,000 images.
Stage 2: Semi-Automatic. SAM was improved using Stage 1 data. The updated model automatically generated masks for a subset of objects in each image, and annotators focused on labeling the remaining objects that the model missed. This process increased mask diversity by encouraging annotators to address less obvious objects. This stage added another 5.9 million masks from 180,000 images.
Stage 3: Fully Automatic. Using the further-improved model, SAM was prompted with a dense regular grid of 32x32 foreground points per image. Ambiguous points generated multiple masks, which were filtered and deduplicated using non-maximum suppression. This fully automatic pipeline was applied to all 11 million images, generating the bulk of the 1.1 billion masks.
All images in SA-1B were licensed and sourced with privacy in mind. Faces and license plates were automatically blurred using the RetinaFace detection model. The dataset contains no captions, photographer names, or other personally identifying metadata. A user removal request mechanism was provided, and the images span a geographically diverse distribution across more than 200 countries.
SAM's model weights and code were released under the Apache 2.0 license. The SA-1B dataset was released for research purposes.
SAM was trained using a combination of the SA-1B data engine stages. The training process used the AdamW optimizer. The image encoder was initialized from MAE-pretrained ViT weights. Training simulated an interactive segmentation scenario where prompts were sampled from ground-truth masks: the model received iterative point prompts, with each new point placed at the largest error region of the previous prediction.
The training loss combined focal loss and dice loss for the mask prediction and mean squared error for the IoU prediction head. The model was trained to produce three output masks per prompt to handle ambiguity, and only the mask with the lowest loss was used for backpropagation during training.
SAM was evaluated across a diverse zero-shot benchmark suite spanning 23 segmentation datasets. The evaluation used both mean Intersection over Union (mIoU) and human quality ratings.
Using a single foreground point prompt, SAM achieved competitive results across the 23-dataset evaluation suite. SAM outperformed the strong RITM interactive segmentation baseline on 16 of the 23 datasets in terms of mIoU. On the remaining datasets, the performance gap was often small. Human raters consistently rated SAM's masks highly, even in cases where mIoU was lower, because the ambiguity-aware multi-mask output often produced valid segmentations that simply differed from the ground truth annotation.
When evaluated on the BSDS500 edge detection benchmark, SAM achieved a recall of 0.928 at 50% precision without any edge-specific training. This result matched early deep learning methods that were explicitly trained for edge detection, demonstrating the richness of SAM's learned representations.
SAM's automatic mask generation capability was evaluated as an object proposal generator on the LVIS dataset. SAM generated high-quality proposals that outperformed the ViTDet-H baseline, particularly at higher IoU thresholds where mask quality matters most.
The following table compares SAM with notable prior segmentation approaches across several dimensions.
| Model | Year | Segmentation Type | Training Data | Promptable? | Zero-Shot Transfer? | Key Strength |
|---|---|---|---|---|---|---|
| U-Net | 2015 | Semantic | Task-specific (small datasets) | No | No | Biomedical segmentation with limited data |
| Mask R-CNN | 2017 | Instance | COCO (~118K images) | No | No | Two-stage instance segmentation |
| DeepLabv3+ | 2018 | Semantic | Task-specific | No | No | Atrous convolution, multi-scale features |
| RITM | 2022 | Interactive | COCO + LVIS | Yes (clicks) | Limited | Strong interactive baseline |
| Mask2Former | 2022 | Universal | Task-specific | No | No | Unified semantic, instance, and panoptic |
| SAM (ViT-H) | 2023 | Promptable (any object) | SA-1B (11M images, 1.1B masks) | Yes (points, boxes, masks, text) | Yes | General-purpose, zero-shot, real-time |
In July 2024, Meta AI released SAM 2 (Segment Anything Model 2), extending the Segment Anything concept from images to videos. The paper, authored by Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Radle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollar, and Christoph Feichtenhofer, received the Best Paper Honorable Mention award at ICLR 2025, one of only six papers recognized out of more than 11,000 submissions.
SAM 2 is the first unified model for promptable segmentation in both images and videos. When applied to a single image (with the memory module empty), SAM 2 functions identically to an image segmentation model. When applied to video, it processes frames sequentially in a streaming fashion, maintaining a memory of past frames and predictions.
SAM 2 builds on the original SAM design but introduces several new components for temporal reasoning.
Image Encoder. SAM 2 replaces the MAE-pretrained ViT from SAM with a Hiera (Hierarchical) image encoder, also pretrained with MAE. Hiera is a hierarchical vision transformer that produces multi-scale features, with stride-16 and stride-32 features from its later stages fused via a feature pyramid network. Stride-4 and stride-8 features from earlier stages are used during mask decoder upsampling for finer spatial detail.
Prompt Encoder and Mask Decoder. These components follow the same design as SAM. The mask decoder uses two-way transformer blocks that update both prompt and frame embeddings. SAM 2 adds a new occlusion prediction head to the mask decoder, which predicts whether the target object is visible or occluded in the current frame.
Memory Encoder. After processing a frame, the memory encoder fuses the downsampled output mask with the image encoder embeddings through convolutional layers and projects them to a compact 64-dimensional representation for storage in the memory bank.
Memory Bank. The memory bank is a FIFO (first-in, first-out) queue that retains spatial feature maps from the N most recent frames (N=6 by default) and up to M prompted frames. It also stores lightweight 256-dimensional object pointer vectors (split into 4 x 64-dimensional tokens) extracted from the mask decoder output. Temporal position embeddings are added to recent frame memories to encode their relative time position.
Memory Attention. Memory attention is the mechanism that allows SAM 2 to condition the current frame's processing on information from past frames. It is implemented as a stack of L=4 transformer blocks. Each block performs self-attention on the current frame's features, followed by cross-attention to the memories and object pointers stored in the memory bank. The cross-attention layers use 2D Rotary Position Embeddings (RoPE) for spatial encoding.
SAM 2 processes video frames one at a time in a streaming fashion, which makes it possible to segment arbitrarily long videos in real time without loading the entire video into memory. Users can provide prompts (points, boxes, or masks) on any frame, and the model propagates the segmentation both forward and backward through the video. This design supports interactive refinement: a user can correct the model's prediction on any frame, and the correction propagates to update segmentation on surrounding frames.
SAM 2 was trained on the SA-V (Segment Anything Video) dataset, the largest video segmentation dataset at the time of release.
| Statistic | Value |
|---|---|
| Total videos | 50,900 |
| Total hours of video | 196 |
| Total frames | 4.2 million |
| Total masklets (spatiotemporal masks) | 642,600 |
| Manual masklets | 190,900 |
| Automatic masklets (verified by annotators) | 451,700 |
| Total annotated masks across all frames | 35.5 million |
| Indoor vs. outdoor | 54% indoor, 46% outdoor |
| Average video duration | 14 seconds |
| Geographic coverage | 47 countries |
| Disappearance rate (manual annotations) | 42.5% |
| Video resolution range | 240p to 4K |
| Average resolution | 1,401 x 1,037 pixels |
The SA-V dataset is approximately 53 times larger than existing video object segmentation (VOS) datasets in terms of total annotated masks. It was released under the CC BY 4.0 license for research use.
SAM 2 is available in four sizes based on different Hiera backbone configurations.
| Variant | Parameters | FPS (A100, uncompiled) | FPS (A100, compiled) | SA-V Test (J&F) | MOSE Val (J&F) | LVOS v2 (J&F) |
|---|---|---|---|---|---|---|
| SAM 2 Tiny | 38.9M | 47.2 | 91.2 | 75.0 | 70.9 | 75.3 |
| SAM 2 Small | 46.0M | 43.3 | 84.8 | 74.9 | 71.6 | 76.4 |
| SAM 2 Base+ | 80.8M | 34.8 | 64.1 | 74.7 | 72.8 | 75.5 |
| SAM 2 Large | 224.4M | 24.2 | 39.5 | 76.0 | 74.6 | 79.2 |
All variants run at real-time or near-real-time speeds on an NVIDIA A100 GPU. With torch.compile enabled, even the Tiny variant exceeds 90 FPS. The Large variant provides the highest accuracy across all benchmarks.
SAM 2 was evaluated on established video object segmentation (VOS) benchmarks using the J&F metric (the average of Jaccard index J and contour accuracy F). In the semi-supervised setting with a first-frame 3-click prompt, SAM 2 achieved strong results.
| Method | DAVIS 2017 (J&F) | MOSE Val (J&F) | LVOS Val (J&F) |
|---|---|---|---|
| XMem | 86.0 | 59.6 | N/A |
| DeAOT | 86.2 | 59.9 | N/A |
| DEVA | 87.0 | 66.0 | 55.9 |
| Cutie-base+ | 88.1 | 71.7 | N/A |
| SAM 2 (Base+) | 90.2 | 76.6 | 78.0 |
| SAM 2 (Large) | 90.7 | 77.9 | 78.0 |
SAM 2 outperformed all prior methods by substantial margins on the challenging MOSE and LVOS benchmarks, which feature heavy occlusions, complex object interactions, and long-duration videos. On the classic DAVIS 2017 benchmark, SAM 2 also achieved the highest scores.
In interactive video segmentation evaluations, SAM 2 achieved better segmentation accuracy while requiring 3 times fewer user interactions compared to prior interactive approaches.
For image segmentation, SAM 2 was also evaluated on the SA-23 zero-shot benchmark suite (the same 23-dataset suite used for SAM). SAM 2 achieved 61.9 mIoU with a single click when trained on a mix of SA-1B and SA-V data, compared to 58.9 mIoU for SAM trained on SA-1B alone. SAM 2 also runs at 130.1 FPS for images on an A100, which is 6 times faster than the original SAM (21.7 FPS).
SAM 2 training involved multiple stages:
Data augmentation included horizontal flips, affine transforms, color jittering, grayscale conversion, and mosaic 2x2 composition (at 10% probability).
In September 2024, Meta released SAM 2.1, an improved version of SAM 2 with enhanced segmentation performance across diverse scenarios.
SAM 2.1 addresses several limitations of SAM 2 through targeted training improvements:
SAM 2.1 showed improvements over SAM 2 across all benchmarks.
| Variant | SA-V Test (J&F) | MOSE Val (J&F) | LVOS v2 (J&F) |
|---|---|---|---|
| SAM 2.1 Tiny | 76.5 | 71.8 | 77.3 |
| SAM 2.1 Small | 76.6 | 73.5 | 78.3 |
| SAM 2.1 Base+ | 78.2 | 73.7 | 78.2 |
| SAM 2.1 Large | 79.5 | 74.6 | 80.6 |
Compared to SAM 2, the Large variant improved from 76.0 to 79.5 on SA-V Test (a gain of 3.5 points) and from 79.2 to 80.6 on LVOS v2.
In December 2024, Meta released further engineering improvements:
vos_optimized=True, delivering a major speedup for video object segmentation.SAM 2.1 is available through multiple platforms including the official GitHub repository, Hugging Face, and Amazon SageMaker JumpStart.
SAM and SAM 2 have been adopted across a wide range of domains, both as standalone tools and as components within larger systems.
SAM has attracted significant attention in the medical imaging community. Researchers have explored its use for segmenting organs in CT and MRI scans, detecting tumors, counting cells in microscopy, and delineating surgical tools in endoscopic video. Specialized adaptations include MedSAM, which fine-tunes SAM on large collections of medical images, and LiteMedSAM, a lightweight variant that runs 10 times faster than MedSAM. Medical SAM 2 extends the video-based approach to process 3D medical volumes (such as CT scans) by treating slices as video frames.
However, studies have shown that SAM's out-of-the-box performance on medical images is significantly lower than on natural images. Evaluations across 12 medical image segmentation datasets found that SAM's Dice scores were 0.1 to 0.7 points lower than specialized medical segmentation algorithms. Fine-tuning or adapter-based approaches are typically necessary for competitive medical imaging performance.
SAM has been applied to satellite and aerial image analysis for tasks including land use classification, building footprint extraction, flood mapping, and agricultural monitoring. Its ability to segment objects without class-specific training makes it useful for tasks where labeled data is scarce. Research on landslide detection in satellite imagery has combined cascade R-CNN with SAM 2 for improved accuracy. SAM delivers strong performance when given box prompts derived from other detection models, though its point-prompt accuracy on remote sensing data can be inconsistent due to the domain gap between natural images and satellite imagery.
SAM 2's video segmentation capabilities enable applications in video editing, including object selection and tracking, background removal, rotoscoping, and visual effects compositing. Users can click on an object in a single frame, and SAM 2 tracks and segments that object throughout the entire video. This dramatically reduces the manual effort required for tasks that traditionally demanded frame-by-frame annotation.
In robotics, SAM provides open-world object segmentation that helps robots understand and interact with unstructured environments. Robotic manipulation tasks, such as grasping specific items from cluttered shelves, benefit from SAM's ability to segment arbitrary objects without predefined class lists. Multi-modal extensions that combine SAM with language models enable robots to segment objects described in natural language.
SAM and its variants have been explored for autonomous driving perception pipelines, where comprehensive segmentation of roads, vehicles, pedestrians, and obstacles is critical. While task-specific models like panoptic segmentation networks remain dominant in production systems, SAM's zero-shot capabilities are useful for handling unusual objects or edge cases not covered by the training distribution of specialized models.
SAM's masks serve as inputs to 3D reconstruction pipelines, providing object-level decomposition of scenes. Combining SAM with depth estimation models or NeRF enables object-aware 3D scene reconstruction. SAM 2's temporal consistency in video also supports multi-view 3D reconstruction workflows.
The computational cost of SAM's ViT-H image encoder has motivated research into more efficient alternatives. Several notable variants have been developed.
| Variant | Developer | Key Approach | Speed Improvement |
|---|---|---|---|
| FastSAM | Zhao et al. (2023) | Replaces ViT encoder with YOLOv8 backbone | 50x faster than SAM |
| MobileSAM | Zhang et al. (2023) | Distills ViT-H encoder into a lightweight TinyViT | 60x smaller image encoder |
| EfficientSAM | Meta (2023) | Uses SAMI-pretrained lightweight ViT encoders | Significant speedup with minimal accuracy loss |
| EfficientViT-SAM | MIT (2024) | Replaces ViT-H with EfficientViT backbone | Hardware-efficient, no accuracy loss claimed |
FastSAM, for example, uses a YOLO-based architecture (23.7 MB, 11.8M parameters) that runs at 55.9 ms per image on CPU, compared to SAM ViT-B at 49,401 ms per image on CPU. MobileSAM (40.7 MB, 10.1M parameters) achieves a similar speedup. These efficient variants trade some segmentation quality for dramatically faster inference, making SAM-like capabilities accessible on edge devices and mobile platforms.
Despite its strong generalization, SAM has several known limitations:
The release of SAM marked a turning point for computer vision research and applications. Several aspects of the project have had lasting influence:
Foundation model paradigm for vision. SAM demonstrated that the foundation model approach, which had been highly successful in NLP, could be applied to dense pixel-level vision tasks. This inspired subsequent foundation models for other visual tasks, including depth estimation, optical flow, and pose estimation.
Data engine methodology. The three-stage data engine used to create SA-1B showed how a model can be used in a feedback loop with human annotators to create training data at scale. This methodology has been adopted by other projects building large-scale vision datasets.
Scale of annotations. The SA-1B dataset, with its 1.1 billion masks, was an order of magnitude larger than any previous segmentation dataset. This scale enabled generalization properties that were not achievable with smaller datasets.
Open release. By releasing the model weights, code, dataset, and an interactive demo under permissive licenses, Meta enabled rapid adoption and follow-up research across the community. Within two years of its release, the original SAM paper had been cited thousands of times, and hundreds of derivative works had been published.
Video extension. SAM 2 extended the paradigm to video, establishing that a single model could handle both image and video segmentation tasks in a unified framework. The streaming memory architecture influenced subsequent work on temporal reasoning in vision models.