SAM 2
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,310 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,310 words
Add missing citations, update stale details, or suggest a clearer explanation.
SAM 2 (Segment Anything Model 2) is a promptable visual segmentation model for both images and video developed by Meta AI and released on 29 July 2024.[^1] It extends the original Segment Anything Model (SAM) by introducing a streaming memory module that propagates user prompts and predicted masks across video frames, allowing a single model to handle still-image segmentation and video object segmentation in a unified framework.[^2] The model was trained on a newly collected dataset, SA-V (Segment Anything Video), comprising roughly 51,000 videos and approximately 643,000 spatio-temporal mask annotations (masklets).[^3] Meta released the weights, training code, and dataset under permissive licenses (Apache 2.0 for code and checkpoints, CC BY 4.0 for SA-V), and the system has since been adopted as a building block in research and industrial pipelines for video annotation, medical imaging, and other downstream tasks.[^4][^5]
| Field | Value |
|---|---|
| Developer | Meta AI (FAIR) |
| Initial release | 29 July 2024 (SAM 2)[^1] |
| Latest version | SAM 2.1 (30 September 2024)[^6] |
| Paper | Ravi et al., "SAM 2: Segment Anything in Images and Videos," arXiv:2408.00714[^7] |
| Code license | Apache 2.0[^5] |
| Dataset license | CC BY 4.0 (SA-V)[^8] |
| Model sizes | Tiny (38.9M), Small (46M), Base+ (80.8M), Large (224.4M) parameters[^9] |
| Image backbone | Hiera (MAE-pretrained hierarchical ViT)[^10] |
| Dataset | SA-V: ~51K videos, ~643K masklets[^3] |
| Tasks | Promptable image segmentation, semi-supervised and interactive video object segmentation |
The first Segment Anything Model was released by Meta AI in April 2023, framing segmentation as a promptable task in which a user supplies points, boxes, or coarse masks and the model returns one or more valid object masks.[^11] SAM was trained on SA-1B, a dataset of more than one billion masks collected from eleven million images. While SAM established strong zero-shot generalization across images, it operated on static images only. Researchers and practitioners who wanted to apply SAM to video typically combined it with separate trackers such as XMem or Cutie, an approach that often resulted in error accumulation when an object went occluded or re-emerged later in a clip.[^7]
SAM 2 was framed by its authors as a generalization of SAM to the temporal domain. Instead of treating video segmentation as a downstream coupling of an image segmenter and a tracker, SAM 2 unifies the task by adding a streaming memory module so that prompts and masks given on any frame influence predictions on subsequent frames.[^7] The authors describe the resulting capability as "promptable visual segmentation" (PVS), where a user provides a small number of clicks, boxes, or mask prompts on any subset of frames in a video, and the model returns a complete spatio-temporal mask (a "masklet") covering all frames.[^7]
The model and the SA-V dataset were announced on 29 July 2024 in a Meta AI research blog post and a companion arXiv preprint.[^1][^7] The paper lists Nikhila Ravi as first author, with contributions from Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feichtenhofer.[^7] Several of these authors had previously co-authored the original SAM paper and broader computer vision work at Meta, including Hiera and Masked Autoencoders.
The initial release on the Meta AI blog included the paper, four pretrained checkpoints, a web demo, training and inference code on GitHub, and the SA-V dataset.[^1][^5] A revised release labeled SAM 2.1 followed on 30 September 2024 with updated checkpoints and the first public release of the training code.[^6] In February 2025, Meta and Amazon announced that SAM 2.1 was available as a managed model in Amazon SageMaker JumpStart.[^12]
SAM 2 is structured as five interacting components: an image encoder, a memory attention module, a prompt encoder, a mask decoder with an occlusion head, and a memory encoder/memory bank that retains information from previous frames.[^7] When SAM 2 is applied to a single image, the memory bank is empty and the model behaves like a faster, more accurate image segmenter than SAM. When applied to a video, the memory bank fills with features and predictions from earlier frames, allowing the network to propagate user prompts forward in time.[^2]
The image encoder is a hierarchical Vision Transformer called Hiera, pretrained as a Masked autoencoder on natural images.[^10] Hiera was introduced by Ryali et al. in 2023 as a stripped-down hierarchical Transformer that uses standard attention blocks across stages of different spatial resolution, instead of the bespoke shifted-window or convolutional attention patterns used by alternatives such as the Swin Transformer or MViT.[^10] SAM 2 evaluates four backbone scales corresponding to Hiera-T, Hiera-S, Hiera-B+, and Hiera-L; these yield models with 38.9, 46.0, 80.8, and 224.4 million parameters respectively.[^9]
A Feature Pyramid Network (FPN) on top of the Hiera backbone fuses stride-16 and stride-32 features from Stages 3 and 4 to produce per-frame image embeddings used by the rest of the network.[^7] Higher-resolution features from Stages 1 and 2 (stride 4 and 8) are kept aside and routed via skip connections directly into the mask decoder's upsampling layers, where they contribute fine spatial detail for the final mask output.[^7] The encoder is run exactly once per frame, regardless of how many prompts or how many objects the user later interacts with, which makes the dominant cost in video inference independent of the number of objects.[^7]
The memory attention module is a stack of transformer blocks (the paper uses L=4) that conditions the current frame's tokens on features from previous frames and on the prompts provided so far.[^7] Each block performs self-attention over the current frame tokens, followed by cross-attention against the contents of the memory bank, and finally an MLP.[^7] The memory attention layers use 2-D spatial Rotary Position Embedding (RoPE) so that the geometry of past frames can be related to the spatial position of the current frame.[^7]
This module is what distinguishes SAM 2 from a stateless image segmenter. Without memory attention, each frame would be processed independently and a click placed on frame zero would not influence predictions on frame one. With memory attention, prompts and predicted masks effectively "flow" across time, giving the network the ability to track objects through occlusion and reappearance.[^2]
The memory encoder takes the predicted mask for the current frame, fuses it with the Hiera image embedding through lightweight convolutional layers, and produces a compact spatial feature map that is appended to the memory bank.[^7] The memory bank is implemented as two FIFO (first-in, first-out) queues: one stores up to N=6 recent frame memories, and the other stores memories from frames where the user provided explicit prompts.[^7] Together with a set of low-dimensional "object pointer" vectors extracted from the mask decoder's output tokens, the memory bank supplies the cross-attention keys and values that condition future frames.[^7] Learned occlusion embeddings represent frames in which the target object is not visible, so that the model can reason about temporary disappearance without confusing it with a true mask change.[^7]
The prompt encoder and mask decoder are close adaptations of the corresponding components in the original SAM.[^7] The prompt encoder converts positive/negative point clicks, bounding boxes, and coarse mask hints into token embeddings using positional encodings and learned vocabulary embeddings for sparse prompts, plus convolutional layers for dense mask prompts.[^7] The mask decoder takes the prompt tokens and the memory-attended image tokens and produces one or more candidate masks via a small Transformer that exchanges information between tokens and image features.[^7]
Two changes are notable relative to SAM. First, the SAM 2 mask decoder predicts multiple candidate masks for ambiguous prompts as in SAM, but selects among them using a confidence head trained jointly with the object pointer tokens, which improves consistency across frames.[^7] Second, SAM 2 adds an occlusion prediction head that outputs a per-frame binary signal indicating whether the target object is visible at all on that frame.[^7] The occlusion head is what allows the model to abstain from emitting a mask when the object is fully occluded, instead of hallucinating a plausible but incorrect region.
For video, SAM 2 is designed to operate in a streaming fashion: frames are processed one at a time in temporal order, the memory encoder writes summaries into the memory bank, and the memory attention reads from the bank to condition the next frame.[^2] This streaming design has two consequences. First, latency per frame is roughly constant after warm-up, so the system can run in real time on the Tiny, Small, and Base+ variants on an NVIDIA A100 GPU.[^9] Second, the bounded memory bank size keeps memory consumption flat as the video grows in length, which is important for long videos and for interactive use cases where the user may scrub back and forth.
A central contribution of the SAM 2 release is the SA-V (Segment Anything Video) dataset, which the authors describe as the largest open video segmentation dataset at the time of release.[^3] The dataset is shared under the CC BY 4.0 license, with downloads hosted on the Meta AI datasets page.[^8]
SA-V contains approximately 50,900 source videos and roughly 643,000 masklets, where a masklet is a spatio-temporal mask covering a single object across multiple video frames.[^3] The data is divided into a manually annotated subset (SA-V Manual) of around 190,900 masklets and an automatically annotated subset (SA-V Auto) of around 451,700 masklets that were generated by SAM 2 and verified by human annotators.[^3] Videos in SA-V average approximately fourteen seconds in length at a typical resolution of 1401×1037, and they are class-agnostic: there are no semantic labels attached to the masks.[^8] The dataset was assembled from footage spanning 47 countries, with both indoor and outdoor scenes.[^4]
For comparison, the authors note that SA-V is roughly 15 times larger than the largest prior open video segmentation datasets in terms of videos and more than 50 times larger in terms of masks, with BURST and UVO-dense cited as prior baselines.[^3][^7]
SA-V was collected through a model-in-the-loop data engine that proceeded in three phases, each tightening the cost of annotation as the model improved.[^7]
The trajectory illustrates a recurring pattern in foundation model data construction: early annotation is slow and unaided, intermediate models accelerate it, and the final model is good enough to do most of the work with humans acting as verifiers. The same pattern was used in the original SAM data engine for SA-1B and in similar form by other large vision datasets.[^11]
SAM 2 is trained on a mixture of image and video data. The published mixture is approximately 15.5% SA-1B images, 49.5% SA-V Manual and SA-V Auto masklets, 15.1% an internal Meta dataset of 62,900 videos with 69,600 masklets, 9.4% MOSE, 9.2% YouTube-VOS, and 1.3% DAVIS.[^7] During training, the model sees short randomly sampled video clips together with a randomly sampled set of prompts; the loss combines per-frame mask losses, occlusion classification losses, and an IoU-style confidence prediction loss.[^7]
The Hiera image encoder is initialized from MAE pretraining and is updated jointly with the rest of the model during SAM 2 training.[^7] Hiera-L, the largest backbone, drives both peak accuracy and the bulk of computational cost; the Tiny, Small, and Base+ variants exist primarily to support real-time inference on lower-end hardware.[^9]
On the 23-dataset zero-shot benchmark used in the original SAM paper, SAM 2 (Hiera-B+) achieves higher one-click mean Intersection-over-Union (mIoU) than SAM (58.9 versus 58.1) while running about 6 times faster, with throughput of approximately 130 frames per second compared with about 22 for SAM.[^7][^13] The image-only result is consistent with the architectural fact that SAM 2 is, on a single image, essentially SAM with a Hiera backbone in place of the original ViT-H encoder and with the memory bank empty.[^2]
SAM 2 reports state-of-the-art results across several semi-supervised video object segmentation (VOS) benchmarks when given the ground-truth mask on the first frame.[^7] On DAVIS 2017 validation, the Hiera-B+ model achieves a J&F score of 90.2 compared with 88.1 for the strongest prior approach (Cutie-base+).[^7] On YouTube-VOS 2019 validation the score is 88.6 versus 87.5, on MOSE validation 76.6 versus 71.7, on LVOS validation 78.0 versus 66.0, and on the SA-V validation split 76.8 versus 61.3.[^7] The SA-V test split shows similar gains, with the Hiera-L checkpoint reaching roughly 79.5 J&F under SAM 2.1.[^9]
A second evaluation regime, interactive video segmentation, tests how quickly a user can produce a clean masklet by providing successive clicks. The authors report that SAM 2 reaches the accuracy of prior approaches such as SAM+XMem++ and SAM+Cutie with roughly 3× fewer user clicks on standard benchmarks.[^1][^7] On a held-out internal benchmark, end-to-end annotation through the Phase 3 SAM 2 interface was approximately 8.4 times faster than per-frame SAM annotation.[^7]
The four official checkpoints expose a tradeoff between accuracy and throughput. The table below summarizes the released numbers for SAM 2.1.[^9]
| Variant | Parameters | Speed (A100, FPS) | SA-V test J&F |
|---|---|---|---|
| Hiera-T (Tiny) | 38.9M | 91.2 | 76.5 |
| Hiera-S (Small) | 46.0M | 84.8 | 76.6 |
| Hiera-B+ (Base+) | 80.8M | 64.1 | 78.2 |
| Hiera-L (Large) | 224.4M | 39.5 | 79.5 |
Reported speeds are measured on a single A100 GPU under automatic mixed-precision in bfloat16, using PyTorch 2.3.1 and CUDA 12.1.[^14] Subsequent updates to the codebase added support for torch.compile, which the maintainers report as a substantial speedup for video inference.[^5]
On 30 September 2024, Meta released SAM 2.1, a refresh of all four checkpoints together with training code and updated demo assets.[^6] The principal changes were: additional data augmentation aimed at small and visually similar objects; training on longer frame sequences to improve handling of occlusion and reappearance; revised positional encoding for memory tokens and object pointers; and the first public release of training code, which had been withheld at the initial launch.[^6][^15] Reported benchmark numbers for SAM 2.1 are slightly higher than the original SAM 2 checkpoints across video benchmarks.[^9]
In February 2025, Meta and Amazon announced that all four SAM 2.1 variants were available as managed deployments through Amazon SageMaker JumpStart, with inference endpoints supporting both image and video segmentation.[^12] Several third-party platforms have also packaged SAM 2 for end users: Roboflow integrated SAM 2 as a label-assist tool and offers a hosted API,[^16] and Ultralytics ships SAM 2 weights as part of its computer-vision SDK with helpers for inference and visualization.[^14]
Because SAM 2 is permissively licensed, the model has spawned a substantial set of community derivatives. Examples include SAM2-UNet, which uses the SAM 2 image encoder as a backbone for a U-Net-style architecture targeting medical and natural images,[^17] and Efficient-SAM2, which proposes object-aware visual encoding and memory retrieval to accelerate inference.[^18] These derivatives appear primarily on arXiv preprints and on Hugging Face.
SAM 2's promptable, video-aware design lends itself to interactive annotation, downstream computer vision pipelines, and a variety of vertical applications.
The Meta launch materials also list disaster response, wildlife monitoring, and digital fashion as motivating examples.[^4]
The table below summarizes the principal differences between SAM and SAM 2.
| Property | SAM (2023) | SAM 2 (2024) |
|---|---|---|
| Input domain | Images | Images and video |
| Image encoder | Plain ViT (B/L/H) | Hiera (T/S/B+/L) |
| Memory module | None | Streaming memory bank + memory attention |
| Occlusion head | No | Yes |
| Dataset | SA-1B (≈1.1B masks, 11M images) | SA-V (≈643K masklets, 51K videos) + SA-1B |
| Speed (image, A100) | ≈22 FPS (ViT-H)[^7] | ≈130 FPS (Hiera-B+)[^7] |
| License | Apache 2.0 | Apache 2.0[^5] |
| Released | April 2023[^11] | 29 July 2024[^1] |
In short, SAM 2 keeps the prompt-driven philosophy of SAM and adds streaming memory plus a new image backbone, while expanding training data to include large-scale video.
Several research strands provide the technical context for SAM 2:
Despite strong benchmark results, several limitations are reported in the SAM 2 paper and in subsequent independent evaluations.
Initial reception of SAM 2 in technical press and developer communities was largely positive, with coverage emphasizing the 6× speedup and improved accuracy on image segmentation, the 3× reduction in interactions for video annotation, and the open release of weights, code, and dataset.[^13][^1] Several practitioners noted that SAM 2 effectively replaced the previously common pattern of "SAM plus tracker" for video pipelines.[^16]
Within the computer vision research community, SAM 2 has become a common baseline and feature extractor for video segmentation papers published since late 2024, and its checkpoints are frequently used as backbones for domain-adapted derivatives.[^17][^18] Industrial adoption has been led by tools focused on annotation and labeling, including Roboflow's integration as a labeling assistant and Ultralytics' inclusion of SAM 2 in its SDK.[^16][^14] Cloud availability through Amazon SageMaker JumpStart further reduced the operational cost of running SAM 2 at scale.[^12]
The SA-V dataset has been less widely adopted than SA-1B for pretraining, in part because of its size and in part because it is class-agnostic; however, it has become a standard evaluation suite for video segmentation systems, with the SA-V validation and test splits appearing alongside DAVIS, MOSE, and YouTube-VOS in subsequent papers.[^7][^3]