SAM 2

AI Models Computer Vision Meta AI

22 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

21 citations

Revision

v3 · 4,304 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

SAM 2 (Segment Anything Model 2) is a promptable visual segmentation model for both images and video developed by Meta AI and released on 29 July 2024.^[1] It extends the original Segment Anything Model (SAM) by introducing a streaming memory module that propagates user prompts and predicted masks across video frames, allowing a single model to handle still-image segmentation and video object segmentation in a unified framework.^[2] The model was trained on a newly collected dataset, SA-V (Segment Anything Video), comprising roughly 51,000 videos and approximately 643,000 spatio-temporal mask annotations (masklets).^[3] Meta released the weights, training code, and dataset under permissive licenses (Apache 2.0 for code and checkpoints, CC BY 4.0 for SA-V), and the system has since been adopted as a building block in research and industrial pipelines for video annotation, medical imaging, and other downstream tasks.^[4]^[5]

Infobox

Field	Value
Developer	Meta AI (FAIR)
Initial release	29 July 2024 (SAM 2)^[1]
Latest version	SAM 2.1 (30 September 2024)^[6]
Paper	Ravi et al., "SAM 2: Segment Anything in Images and Videos," arXiv:2408.00714^[7]
Code license	Apache 2.0^[5]
Dataset license	CC BY 4.0 (SA-V)^[8]
Model sizes	Tiny (38.9M), Small (46M), Base+ (80.8M), Large (224.4M) parameters^[9]
Image backbone	Hiera (MAE-pretrained hierarchical ViT)^[10]
Dataset	SA-V: ~51K videos, ~643K masklets^[3]
Tasks	Promptable image segmentation, semi-supervised and interactive video object segmentation

Background

From SAM to SAM 2

The first Segment Anything Model was released by Meta AI in April 2023, framing segmentation as a promptable task in which a user supplies points, boxes, or coarse masks and the model returns one or more valid object masks.^[11] SAM was trained on SA-1B, a dataset of more than one billion masks collected from eleven million images. While SAM established strong zero-shot generalization across images, it operated on static images only. Researchers and practitioners who wanted to apply SAM to video typically combined it with separate trackers such as XMem or Cutie, an approach that often resulted in error accumulation when an object went occluded or re-emerged later in a clip.^[7]

SAM 2 was framed by its authors as a generalization of SAM to the temporal domain. Instead of treating video segmentation as a downstream coupling of an image segmenter and a tracker, SAM 2 unifies the task by adding a streaming memory module so that prompts and masks given on any frame influence predictions on subsequent frames.^[7] The authors describe the resulting capability as "promptable visual segmentation" (PVS), where a user provides a small number of clicks, boxes, or mask prompts on any subset of frames in a video, and the model returns a complete spatio-temporal mask (a "masklet") covering all frames.^[7]

Release and authorship

The model and the SA-V dataset were announced on 29 July 2024 in a Meta AI research blog post and a companion arXiv preprint.^[1]^[7] The paper lists Nikhila Ravi as first author, with contributions from Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feichtenhofer.^[7] Several of these authors had previously co-authored the original SAM paper and broader computer vision work at Meta, including Hiera and Masked Autoencoders.

The initial release on the Meta AI blog included the paper, four pretrained checkpoints, a web demo, training and inference code on GitHub, and the SA-V dataset.^[1]^[5] A revised release labeled SAM 2.1 followed on 30 September 2024 with updated checkpoints and the first public release of the training code.^[6] In February 2025, Meta and Amazon announced that SAM 2.1 was available as a managed model in Amazon SageMaker JumpStart.^[12]

Architecture

SAM 2 is structured as five interacting components: an image encoder, a memory attention module, a prompt encoder, a mask decoder with an occlusion head, and a memory encoder/memory bank that retains information from previous frames.^[7] When SAM 2 is applied to a single image, the memory bank is empty and the model behaves like a faster, more accurate image segmenter than SAM. When applied to a video, the memory bank fills with features and predictions from earlier frames, allowing the network to propagate user prompts forward in time.^[2]

Image encoder

The image encoder is a hierarchical Vision Transformer called Hiera, pretrained as a Masked autoencoder on natural images.^[10] Hiera was introduced by Ryali et al. in 2023 as a stripped-down hierarchical Transformer that uses standard attention blocks across stages of different spatial resolution, instead of the bespoke shifted-window or convolutional attention patterns used by alternatives such as the Swin Transformer or MViT.^[10] SAM 2 evaluates four backbone scales corresponding to Hiera-T, Hiera-S, Hiera-B+, and Hiera-L; these yield models with 38.9, 46.0, 80.8, and 224.4 million parameters respectively.^[9]

A Feature Pyramid Network (FPN) on top of the Hiera backbone fuses stride-16 and stride-32 features from Stages 3 and 4 to produce per-frame image embeddings used by the rest of the network.^[7] Higher-resolution features from Stages 1 and 2 (stride 4 and 8) are kept aside and routed via skip connections directly into the mask decoder's upsampling layers, where they contribute fine spatial detail for the final mask output.^[7] The encoder is run exactly once per frame, regardless of how many prompts or how many objects the user later interacts with, which makes the dominant cost in video inference independent of the number of objects.^[7]

Memory attention

The memory attention module is a stack of transformer blocks (the paper uses L=4) that conditions the current frame's tokens on features from previous frames and on the prompts provided so far.^[7] Each block performs self-attention over the current frame tokens, followed by cross-attention against the contents of the memory bank, and finally an MLP.^[7] The memory attention layers use 2-D spatial Rotary Position Embedding (RoPE) so that the geometry of past frames can be related to the spatial position of the current frame.^[7]

This module is what distinguishes SAM 2 from a stateless image segmenter. Without memory attention, each frame would be processed independently and a click placed on frame zero would not influence predictions on frame one. With memory attention, prompts and predicted masks effectively "flow" across time, giving the network the ability to track objects through occlusion and reappearance.^[2]

Memory encoder and memory bank

The memory encoder takes the predicted mask for the current frame, fuses it with the Hiera image embedding through lightweight convolutional layers, and produces a compact spatial feature map that is appended to the memory bank.^[7] The memory bank is implemented as two FIFO (first-in, first-out) queues: one stores up to N=6 recent frame memories, and the other stores memories from frames where the user provided explicit prompts.^[7] Together with a set of low-dimensional "object pointer" vectors extracted from the mask decoder's output tokens, the memory bank supplies the cross-attention keys and values that condition future frames.^[7] Learned occlusion embeddings represent frames in which the target object is not visible, so that the model can reason about temporary disappearance without confusing it with a true mask change.^[7]

Prompt encoder and mask decoder

The prompt encoder and mask decoder are close adaptations of the corresponding components in the original SAM.^[7] The prompt encoder converts positive/negative point clicks, bounding boxes, and coarse mask hints into token embeddings using positional encodings and learned vocabulary embeddings for sparse prompts, plus convolutional layers for dense mask prompts.^[7] The mask decoder takes the prompt tokens and the memory-attended image tokens and produces one or more candidate masks via a small Transformer that exchanges information between tokens and image features.^[7]

Two changes are notable relative to SAM. First, the SAM 2 mask decoder predicts multiple candidate masks for ambiguous prompts as in SAM, but selects among them using a confidence head trained jointly with the object pointer tokens, which improves consistency across frames.^[7] Second, SAM 2 adds an occlusion prediction head that outputs a per-frame binary signal indicating whether the target object is visible at all on that frame.^[7] The occlusion head is what allows the model to abstain from emitting a mask when the object is fully occluded, instead of hallucinating a plausible but incorrect region.

Streaming inference

For video, SAM 2 is designed to operate in a streaming fashion: frames are processed one at a time in temporal order, the memory encoder writes summaries into the memory bank, and the memory attention reads from the bank to condition the next frame.^[2] This streaming design has two consequences. First, latency per frame is roughly constant after warm-up, so the system can run in real time on the Tiny, Small, and Base+ variants on an NVIDIA A100 GPU.^[9] Second, the bounded memory bank size keeps memory consumption flat as the video grows in length, which is important for long videos and for interactive use cases where the user may scrub back and forth.

The SA-V dataset

A central contribution of the SAM 2 release is the SA-V (Segment Anything Video) dataset, which the authors describe as the largest open video segmentation dataset at the time of release.^[3] The dataset is shared under the CC BY 4.0 license, with downloads hosted on the Meta AI datasets page.^[8]

Composition

SA-V contains approximately 50,900 source videos and roughly 643,000 masklets, where a masklet is a spatio-temporal mask covering a single object across multiple video frames.^[3] The data is divided into a manually annotated subset (SA-V Manual) of around 190,900 masklets and an automatically annotated subset (SA-V Auto) of around 451,700 masklets that were generated by SAM 2 and verified by human annotators.^[3] Videos in SA-V average approximately fourteen seconds in length at a typical resolution of 1401×1037, and they are class-agnostic: there are no semantic labels attached to the masks.^[8] The dataset was assembled from footage spanning 47 countries, with both indoor and outdoor scenes.^[4]

For comparison, the authors note that SA-V is roughly 15 times larger than the largest prior open video segmentation datasets in terms of videos and more than 50 times larger in terms of masks, with BURST and UVO-dense cited as prior baselines.^[3]^[7]

The data engine

SA-V was collected through a model-in-the-loop data engine that proceeded in three phases, each tightening the cost of annotation as the model improved.^[7]

Phase 1 (SAM per frame). Annotators segmented each frame of a clip independently using SAM and standard mask-editing tools, with no temporal propagation between frames. This phase produced approximately 16,000 masklets at an average rate of about 37.8 seconds per frame.^[7]
Phase 2 (SAM + early SAM 2). Annotators created an initial mask on the first frame with SAM, then a mask-input-only version of SAM 2 propagated that mask through subsequent frames. Annotators corrected mistakes by re-annotating individual frames from scratch when necessary. This phase produced roughly 63,500 masklets at about 7.4 seconds per frame, a 5.1× speedup over Phase 1.^[7]
Phase 3 (full SAM 2). A more capable SAM 2 accepted both click prompts and mask prompts on any frame, leveraging memory to propagate edits both forwards and backwards. Annotators only had to provide occasional refinement clicks on intermediate frames. This phase produced approximately 197,000 masklets at about 4.5 seconds per frame, an 8.4× speedup over Phase 1.^[7]

The trajectory illustrates a recurring pattern in foundation model data construction: early annotation is slow and unaided, intermediate models accelerate it, and the final model is good enough to do most of the work with humans acting as verifiers. The same pattern was used in the original SAM data engine for SA-1B and in similar form by other large vision datasets.^[11]

Training

SAM 2 is trained on a mixture of image and video data. The published mixture is approximately 15.5% SA-1B images, 49.5% SA-V Manual and SA-V Auto masklets, 15.1% an internal Meta dataset of 62,900 videos with 69,600 masklets, 9.4% MOSE, 9.2% YouTube-VOS, and 1.3% DAVIS.^[7] During training, the model sees short randomly sampled video clips together with a randomly sampled set of prompts; the loss combines per-frame mask losses, occlusion classification losses, and an IoU-style confidence prediction loss.^[7]

The Hiera image encoder is initialized from MAE pretraining and is updated jointly with the rest of the model during SAM 2 training.^[7] Hiera-L, the largest backbone, drives both peak accuracy and the bulk of computational cost; the Tiny, Small, and Base+ variants exist primarily to support real-time inference on lower-end hardware.^[9]

Evaluation

Image segmentation

On the 23-dataset zero-shot benchmark used in the original SAM paper, SAM 2 (Hiera-B+) achieves higher one-click mean Intersection-over-Union (mIoU) than SAM (58.9 versus 58.1) while running about 6 times faster, with throughput of approximately 130 frames per second compared with about 22 for SAM.^[7]^[13] The image-only result is consistent with the architectural fact that SAM 2 is, on a single image, essentially SAM with a Hiera backbone in place of the original ViT-H encoder and with the memory bank empty.^[2]

Video object segmentation

SAM 2 reports state-of-the-art results across several semi-supervised video object segmentation (VOS) benchmarks when given the ground-truth mask on the first frame.^[7] On DAVIS 2017 validation, the Hiera-B+ model achieves a J&F score of 90.2 compared with 88.1 for the strongest prior approach (Cutie-base+).^[7] On YouTube-VOS 2019 validation the score is 88.6 versus 87.5, on MOSE validation 76.6 versus 71.7, on LVOS validation 78.0 versus 66.0, and on the SA-V validation split 76.8 versus 61.3.^[7] The SA-V test split shows similar gains, with the Hiera-L checkpoint reaching roughly 79.5 J&F under SAM 2.1.^[9]

A second evaluation regime, interactive video segmentation, tests how quickly a user can produce a clean masklet by providing successive clicks. The authors report that SAM 2 reaches the accuracy of prior approaches such as SAM+XMem++ and SAM+Cutie with roughly 3× fewer user clicks on standard benchmarks.^[1]^[7] On a held-out internal benchmark, end-to-end annotation through the Phase 3 SAM 2 interface was approximately 8.4 times faster than per-frame SAM annotation.^[7]

Model size and speed

The four official checkpoints expose a tradeoff between accuracy and throughput. The table below summarizes the released numbers for SAM 2.1.^[9]

Variant	Parameters	Speed (A100, FPS)	SA-V test J&F
Hiera-T (Tiny)	38.9M	91.2	76.5
Hiera-S (Small)	46.0M	84.8	76.6
Hiera-B+ (Base+)	80.8M	64.1	78.2
Hiera-L (Large)	224.4M	39.5	79.5

Reported speeds are measured on a single A100 GPU under automatic mixed-precision in bfloat16, using PyTorch 2.3.1 and CUDA 12.1.^[14] Subsequent updates to the codebase added support for torch.compile, which the maintainers report as a substantial speedup for video inference.^[5]

Variants and updates

SAM 2.1

On 30 September 2024, Meta released SAM 2.1, a refresh of all four checkpoints together with training code and updated demo assets.^[6] The principal changes were: additional data augmentation aimed at small and visually similar objects; training on longer frame sequences to improve handling of occlusion and reappearance; revised positional encoding for memory tokens and object pointers; and the first public release of training code, which had been withheld at the initial launch.^[6]^[15] Reported benchmark numbers for SAM 2.1 are slightly higher than the original SAM 2 checkpoints across video benchmarks.^[9]

Managed deployments

In February 2025, Meta and Amazon announced that all four SAM 2.1 variants were available as managed deployments through Amazon SageMaker JumpStart, with inference endpoints supporting both image and video segmentation.^[12] Several third-party platforms have also packaged SAM 2 for end users: Roboflow integrated SAM 2 as a label-assist tool and offers a hosted API,^[16] and Ultralytics ships SAM 2 weights as part of its computer-vision SDK with helpers for inference and visualization.^[14]

Community forks and derivatives

Because SAM 2 is permissively licensed, the model has spawned a substantial set of community derivatives. Examples include SAM2-UNet, which uses the SAM 2 image encoder as a backbone for a U-Net-style architecture targeting medical and natural images,^[17] and Efficient-SAM2, which proposes object-aware visual encoding and memory retrieval to accelerate inference.^[18] These derivatives appear primarily on arXiv preprints and on Hugging Face.

Applications

SAM 2's promptable, video-aware design lends itself to interactive annotation, downstream computer vision pipelines, and a variety of vertical applications.

Video annotation. The most direct use case is rapid generation of pixel-accurate mask annotations for video, either to feed downstream training data pipelines or to support semi-automated editing. The data engine results suggest that human annotators paired with SAM 2 can be roughly an order of magnitude faster than per-frame manual annotation with SAM or prior tools.^[7]
Interactive video editing and effects. Meta highlighted use of SAM 2 outputs as inputs to modern video generation systems for precise editing, including local color edits, object removal, and substitutions; the Meta blog frames this as a complement to text-to-video models like Sora and similar systems.^[1]
Medical imaging. Several follow-up papers evaluate SAM 2 as a zero-shot or fine-tuned segmenter on 2-D and 3-D medical images, including computed tomography (CT) of abdominal organs and optical coherence tomography (OCT) of retinal biomarkers. These works generally find that SAM 2's temporal propagation can be adapted to volumetric data by treating slices as frames, with accuracy that approaches dedicated medical models but degrades on small structures or low-contrast regions.^[19]^[20]
Robotics and autonomous systems. Because SAM 2 can track an object through partial occlusion and provide mask outputs at real-time frame rates on Tiny and Small variants, it has been used as a perception block in robotics demonstrations and as an annotation tool for autonomous driving datasets.^[4]
Surgical and biomedical video analysis. A 2025 analysis of point-based tracking failure modes in surgical videos used SAM 2 as the underlying tracker, reporting strong performance on rigid instruments and weaker performance on deformable tissue where boundary cues are ambiguous.^[21]

The Meta launch materials also list disaster response, wildlife monitoring, and digital fashion as motivating examples.^[4]

Comparison with SAM 1

The table below summarizes the principal differences between SAM and SAM 2.

Property	SAM (2023)	SAM 2 (2024)
Input domain	Images	Images and video
Image encoder	Plain ViT (B/L/H)	Hiera (T/S/B+/L)
Memory module	None	Streaming memory bank + memory attention
Occlusion head	No	Yes
Dataset	SA-1B (≈1.1B masks, 11M images)	SA-V (≈643K masklets, 51K videos) + SA-1B
Speed (image, A100)	≈22 FPS (ViT-H)^[7]	≈130 FPS (Hiera-B+)^[7]
License	Apache 2.0	Apache 2.0^[5]
Released	April 2023^[11]	29 July 2024^[1]

In short, SAM 2 keeps the prompt-driven philosophy of SAM and adds streaming memory plus a new image backbone, while expanding training data to include large-scale video.

Several research strands provide the technical context for SAM 2:

Promptable image segmentation. The original SAM established the promptable-segmentation paradigm and supplied the prompt encoder and mask decoder that SAM 2 inherits.^[11]
Video object segmentation. Memory-augmented VOS trackers such as XMem and Cutie influenced the design of SAM 2's memory bank, although SAM 2 differs in that it learns a unified image and video model from scratch instead of bolting a tracker onto a separate segmenter.^[7]
Hierarchical vision transformers. Hiera builds on a line of hierarchical ViTs that includes the Swin Transformer and MViT, while removing bespoke attention patterns to recover speed and simplicity.^[10]
Masked autoencoding. The Masked autoencoder pretraining strategy provides the initial weights for SAM 2's image encoder and is part of a broader literature on self-supervised representation learning for vision.^[10]
Detection and instance segmentation. Earlier detection-and-segmentation pipelines such as Mask R-CNN and DETR established the closed-vocabulary baseline that SAM and SAM 2 contrast with by being class-agnostic.^[7]
Generative video models. Meta positions SAM 2 as a complementary tool for video generation systems, including Sora-style models, where high-quality masks enable localized edits.^[1]

Limitations

Despite strong benchmark results, several limitations are reported in the SAM 2 paper and in subsequent independent evaluations.

Crowded scenes and similar-looking objects. The paper notes that SAM 2 can confuse similar objects in crowded scenes, particularly across long occlusions or after fast camera motion.^[7] The SAM 2.1 update added augmentations specifically targeting visually similar objects, which mitigated but did not eliminate this issue.^[6]
Long-horizon drift. Because the memory bank is FIFO with a small N, SAM 2 can drift on very long videos when the target object's appearance changes substantially over time. Independent work studying failure modes for point-based tracking in surgical videos reports both catastrophic collapse and over-propagation phenomena, suggesting that high initialization accuracy alone is not sufficient for long-horizon tracking.^[21]
Small structures and weak boundaries. Medical-imaging evaluations report that SAM 2 struggles with small, low-contrast structures, especially in brain MRI subcortical regions, even with extensive prompting. Domain-specific fine-tuning helps but does not match dedicated medical models on all metrics.^[19]
No semantics. SAM 2 produces class-agnostic masks. It does not output object categories and cannot, by itself, distinguish between, for example, a person and a mannequin.^[7] This is a feature rather than a bug from the authors' point of view but limits drop-in use for tasks that require labels.
Single-object scope per inference call. The memory bank stores a single target object per inference session; multi-object tracking is supported by running independent inference per object in parallel, which scales linearly in compute with the number of objects.^[5]
Latency on the largest variant. The Hiera-L checkpoint, while most accurate, runs at around 40 FPS on a single A100 and is not real time on lower-end accelerators; deployment on commodity hardware typically uses the Tiny or Small variants and accepts a modest accuracy reduction.^[9]^[14]

Reception and impact

Initial reception of SAM 2 in technical press and developer communities was largely positive, with coverage emphasizing the 6× speedup and improved accuracy on image segmentation, the 3× reduction in interactions for video annotation, and the open release of weights, code, and dataset.^[13]^[1] Several practitioners noted that SAM 2 effectively replaced the previously common pattern of "SAM plus tracker" for video pipelines.^[16]

Within the computer vision research community, SAM 2 has become a common baseline and feature extractor for video segmentation papers published since late 2024, and its checkpoints are frequently used as backbones for domain-adapted derivatives.^[17]^[18] Industrial adoption has been led by tools focused on annotation and labeling, including Roboflow's integration as a labeling assistant and Ultralytics' inclusion of SAM 2 in its SDK.^[16]^[14] Cloud availability through Amazon SageMaker JumpStart further reduced the operational cost of running SAM 2 at scale.^[12]

The SA-V dataset has been less widely adopted than SA-1B for pretraining, in part because of its size and in part because it is class-agnostic; however, it has become a standard evaluation suite for video segmentation systems, with the SA-V validation and test splits appearing alongside DAVIS, MOSE, and YouTube-VOS in subsequent papers.^[7]^[3]

References

Meta AI, "Introducing Meta Segment Anything Model 2 (SAM 2)", Meta AI Research, 2024-07-29. https://ai.meta.com/research/sam2/. Accessed 2026-05-20. ↩
Meta AI, "SAM 2: Segment Anything in Images and Videos (research page)", Meta AI Research, 2024-07-29. https://ai.meta.com/research/publications/sam-2-segment-anything-in-images-and-videos/. Accessed 2026-05-20. ↩
Meta AI, "SA-V Dataset", Meta AI Datasets, 2024-07-29. https://ai.meta.com/datasets/segment-anything-video/. Accessed 2026-05-20. ↩
Meta AI / AWS, "Update: Expanding access to Meta Segment Anything 2.1 on Amazon SageMaker JumpStart", Meta AI Blog, 2025-02-12. https://ai.meta.com/blog/segment-anything-2/. Accessed 2026-05-20. ↩
facebookresearch, "sam2 GitHub repository (README)", GitHub, 2024-07-29. https://github.com/facebookresearch/sam2. Accessed 2026-05-20. ↩
Encord, "Meta's SAM 2.1 Explained: Improved Performance & Usability", Encord Blog, 2024-10-04. https://encord.com/blog/sam-2.1-explained/. Accessed 2026-05-20. ↩
Ravi, Nikhila et al., "SAM 2: Segment Anything in Images and Videos", arXiv:2408.00714, 2024-08-01. https://arxiv.org/abs/2408.00714. Accessed 2026-05-20. ↩
Meta AI, "Welcome to the SA-V Dataset (downloads)", Meta AI Datasets, 2024-07-29. https://ai.meta.com/datasets/segment-anything-video-downloads/. Accessed 2026-05-20. ↩
facebookresearch, "sam2 GitHub README: model checkpoints and benchmarks", GitHub, 2024-09-30. https://github.com/facebookresearch/sam2#model-description. Accessed 2026-05-20. ↩
Ryali, Chaitanya et al., "Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles", arXiv:2306.00989, 2023-06-01. https://arxiv.org/abs/2306.00989. Accessed 2026-05-20. ↩
Kirillov, Alexander et al., "Segment Anything", arXiv:2304.02643, 2023-04-05. https://arxiv.org/abs/2304.02643. Accessed 2026-05-20. ↩
Amazon Web Services, "Meta SAM 2.1 is now available in Amazon SageMaker JumpStart", AWS Machine Learning Blog, 2025-02-12. https://aws.amazon.com/blogs/machine-learning/meta-sam-2-1-is-now-available-in-amazon-sagemaker-jumpstart/. Accessed 2026-05-20. ↩
Willison, Simon, "SAM 2: The next generation of Meta Segment Anything Model for videos and images", Simon Willison's Weblog, 2024-07-29. https://simonwillison.net/2024/Jul/29/sam-2/. Accessed 2026-05-20. ↩
Ultralytics, "SAM 2: Segment Anything Model 2", Ultralytics Docs, 2024-08-15. https://docs.ultralytics.com/models/sam-2. Accessed 2026-05-20. ↩
Pius, Abish, "Segment Anything (SAM) Updates Once Again to v2.1", Medium, 2024-10-05. https://medium.com/chat-gpt-now-writes-all-my-articles/segment-anything-sam-updates-once-again-to-v2-1-913b4da45080. Accessed 2026-05-20. ↩
Roboflow, "Launch: Use Segment Anything 2 with Roboflow", Roboflow Blog, 2024-07-29. https://blog.roboflow.com/sam-2-roboflow/. Accessed 2026-05-20. ↩
Xiong, Xinyu et al., "SAM2-UNet: Segment Anything 2 Makes Strong Encoder for Natural and Medical Image Segmentation", arXiv:2408.08870, 2024-08-16. https://arxiv.org/abs/2408.08870. Accessed 2026-05-20. ↩
Anonymous, "Efficient-SAM2: Accelerating SAM2 with Object-Aware Visual Encoding and Memory Retrieval", arXiv:2602.08224, 2026-02-12. https://arxiv.org/pdf/2602.08224. Accessed 2026-05-20. ↩
Sengupta, S. et al., "Using Segment Anything Model 2 for Zero-Shot 3D Segmentation of Abdominal Organs in Computed Tomography Scans", PubMed Central, 2024-09-30. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12231515/. Accessed 2026-05-20. ↩
Anonymous, "Segment Anything in Optical Coherence Tomography: SAM 2 for Volumetric Segmentation of Retinal Biomarkers", PubMed Central, 2024-09-15. https://pmc.ncbi.nlm.nih.gov/articles/PMC11428920/. Accessed 2026-05-20. ↩
Anonymous, "When Tracking Fails: Analyzing Failure Modes of SAM2 for Point-Based Tracking in Surgical Videos", arXiv:2510.02100, 2025-10-02. https://arxiv.org/html/2510.02100. Accessed 2026-05-20. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

Hiera Perception Encoder

Infobox

Background

From SAM to SAM 2

Release and authorship

Architecture

Image encoder

Memory attention

Memory encoder and memory bank

Prompt encoder and mask decoder

Streaming inference

The SA-V dataset

Composition

The data engine

Training

Evaluation

Image segmentation

Video object segmentation

Model size and speed

Variants and updates

SAM 2.1

Managed deployments

Community forks and derivatives

Applications

Comparison with SAM 1

Related work

Limitations

Reception and impact

See also

References

Improve this article

Related Articles

Segment Anything Model and Dataset (SAM and SA-1B)

DINOv2

DINOv3

Nougat (model)

Sapiens (computer vision)

DINO (computer vision)

What links here

Related Articles

Segment Anything Model and Dataset (SAM and SA-1B)

DINOv2

DINOv3

Nougat (model)

Sapiens (computer vision)

DINO (computer vision)

What links here