SigLIP

Computer Vision Google DeepMind Multimodal AI

33 min read

Updated Jul 13, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 13, 2026

Fact-checked

In review queue

Sources

38 citations

Revision

v5 · 6,687 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

SigLIP (Sigmoid Loss for Language-Image Pre-training) is a family of vision-language encoders developed by researchers at Google DeepMind that pre-trains image and text encoders by treating each image-text pair as an independent binary classification problem rather than as part of a batch-wide softmax distribution. Introduced by Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer in the 2023 paper "Sigmoid Loss for Language Image Pre-Training", SigLIP replaces the softmax-based contrastive loss that underlies CLIP with a per-pair sigmoid loss that, in the authors' words, "operates solely on image-text pairs and does not require a global view of the pairwise similarities for normalization."^[1] The reformulation reduces the memory cost of large-batch contrastive pre-training, performs better at small batch sizes, and produces vision encoders that have become widely used as the visual backbone in downstream vision-language models. Trained with only four TPUv4 chips, the SigLiT variant reached 84.5% ImageNet zero-shot accuracy in two days, and the paper concluded that a batch size of 32,000 is sufficient, disproving the prevailing assumption that ever-larger batches always help.^[1] A successor family, SigLIP 2, released by a larger Google DeepMind team led by Michael Tschannen in February 2025, extends the recipe with captioning objectives, self-distillation losses, online data curation, multilingual training, and native aspect-ratio variants.^[2]


Original authors	Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer (Google DeepMind)^[1]
First release	arXiv preprint 2023-03-27; ICCV 2023^[1]
Latest version	SigLIP 2 (arXiv 2025-02-20)^[2]
Model sizes (SigLIP 2)	ViT-B (86M), ViT-L (303M), ViT-So400m (400M), ViT-g (1B)^[2]^[3]
Training data	WebLI (10B images, 12B alt-texts, 109 languages)^[4]^[5]
Primary loss	Pairwise sigmoid binary cross-entropy with learnable temperature and bias^[1]
Reference paper (SigLIP)	arXiv:2303.15343^[1]
Reference paper (SigLIP 2)	arXiv:2502.14786^[2]
License	Apache 2.0 (model checkpoints on Hugging Face)^[3]

What problem does SigLIP solve?

Contrastive language-image pre-training, popularized by OpenAI's CLIP in 2021, learns aligned image and text representations by training paired encoders so that the embedding of an image is close to the embedding of its caption and far from the embeddings of other captions in the same training batch.^[1]^[6] The standard CLIP objective is a symmetric softmax cross-entropy over the matrix of pairwise image-text similarities. For a batch of N image-text pairs, the softmax loss requires computing similarities between every image and every text in the batch, then normalizing each row (image-to-text) and each column (text-to-image) of the similarity matrix. Because both normalizations operate over the full batch, the loss is intrinsically global: doubling the batch size doubles the number of negatives that each positive example contrasts against.^[1]

This global-normalization property motivated CLIP-style models to train at very large batch sizes (32k or higher) and motivated systems like OpenAI's CLIP, LAION's OpenCLIP, and EVA-CLIP to invest heavily in distributed contrastive training infrastructure. Two practical issues arose. First, computing and storing the full N x N pairwise similarity matrix at very large N is memory-intensive and bandwidth-intensive in distributed settings, because each device must materialize partial rows and columns from every other device. Second, the optimum batch size for contrastive learning was empirically large but unclear, leaving open whether the benefits of bigger batches were intrinsic to contrastive learning or were artifacts of the softmax formulation.^[1]^[6]

The SigLIP paper, submitted to arXiv on 2023-03-27 and accepted as an oral presentation at the IEEE International Conference on Computer Vision (ICCV) 2023 in Paris, set out to answer those questions by replacing softmax cross-entropy with a per-pair sigmoid loss.^[1]^[7] The work was led by Xiaohua Zhai, then a senior staff research scientist at Google DeepMind, and Lucas Beyer, who together with Alexander Kolesnikov and Basil Mustafa had previously developed the Vision Transformer (ViT) scaling and Locked-image Tuning (LiT) lines of work.^[1]^[7]^[18] The same team subsequently published the SigLIP 2 paper in February 2025, broadening the recipe and the author list to include Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Ye Xia, Olivier Henaff, Jeremiah Harmsen, and Andreas Steiner.^[2]

Predecessor work at Google Brain and Google DeepMind

SigLIP builds on a sequence of vision-language papers from the same research lineage. The most direct predecessor is Locked-image Tuning (LiT), published at CVPR 2022 by Zhai, Wang, Mustafa, Susano Pinto, Alexey Dosovitskiy, Kolesnikov, Andreas Steiner, and Beyer.^[18] LiT keeps a strong pretrained vision tower frozen and trains only a text encoder against it with a contrastive softmax loss, achieving a high zero-shot ImageNet accuracy with comparatively modest text-side compute.^[18] The SigLIP paper's "SigLiT" variant reuses this idea but swaps the softmax for a sigmoid loss. A second relevant predecessor is PaLI (2022), introduced by Xi Chen, Xiao Wang, Soravit Changpinyo and colleagues, which jointly trained an image encoder and a text decoder on the WebLI dataset and demonstrated competitive performance across 100+ languages.^[4] WebLI, the proprietary dataset assembled for PaLI, became the standard pretraining corpus for both SigLIP and SigLIP 2.^[2]^[4]

A complementary thread of work investigated whether captioning losses alone could rival contrastive losses for vision representation learning. CapPa (Image Captioners Are Scalable Vision Learners Too, NeurIPS 2023) by Tschannen, Manoj Kumar, Steiner, Zhai, Neil Houlsby, and Beyer trained a vision encoder paired with a text decoder, alternating autoregressive captioning (25%) and parallel captioning (75%) with all input tokens masked.^[19] CapPa matched or surpassed CLIP on a variety of downstream tasks, suggesting that captioning is a strong auxiliary signal even when the goal is a vision-only representation.^[19] LocCa (Visual Pretraining with Location-aware Captioners, NeurIPS 2024) by Bo Wan, Tschannen, and collaborators added bounding-box and grounded-captioning losses to CapPa, lifting RefCOCO performance to 88.34% on val and giving the decoder explicit spatial supervision.^[20] CapPa and LocCa are explicit components of the SigLIP 2 recipe described below.^[2]

How does the sigmoid loss work?

The central technical idea of SigLIP is to treat each entry of the N x N image-text similarity matrix as a binary classification problem rather than as a row or column of a softmax distribution. For a batch of $N$ image-text pairs (with images $x_i$ and texts $y_j$ ), the model computes cosine similarities $s_{ij}$ between image embedding $f(x_i)$ and text embedding $g(y_j)$ , then applies a sigmoid function $\sigma(u) = \frac{1}{1 + \exp(-u)}$ to a scaled and shifted version of those similarities. The label is +1 when $i = j$ (a matched pair) and -1 otherwise, and the loss is the average negative log-likelihood of the sigmoid prediction for every cell:^[1]

L = -\frac{1}{N} \sum_{i,j} \log\left( \sigma\left( z_{ij} (t s_{ij} + b) \right) \right)

where $t$ is a learnable temperature, $b$ is a learnable bias, and $z_{ij} = +1$ for matched pairs and -1 for unmatched pairs.^[1] The temperature parameter behaves analogously to the temperature in CLIP, and the paper parameterizes it as $\exp(t')$ where $t'$ is a learnable scalar.^[1] The bias term is novel and is initialized to a large negative value (the authors use $b = -10$ ) so that early in training the sigmoid output for all pairs is small, because the off-diagonal entries dominate the loss numerically: in a batch of size $N$ there are $N$ positive pairs and $N(N-1)$ negative pairs, and without the bias the gradient signal would be swamped by negatives.^[1]^[21] The negative initialization shifts initial predictions toward 0 (no match), letting the model accumulate evidence for the diagonal positives over training before being pushed to discriminate negatives.^[1]

Mathematical contrast with the softmax contrastive loss

CLIP uses the symmetric InfoNCE loss applied to logits $L_{ij} = t s_{ij}$ :

L_{\text{softmax}} = -\frac{1}{2N} \sum_i \left[ \log\left( \frac{\exp(L_{ii})}{\sum_j \exp(L_{ij})} \right) + \log\left( \frac{\exp(L_{ii})}{\sum_j \exp(L_{ji})} \right) \right]

Computing the denominator for any row of the matrix requires the similarities of x_i to every text in the batch, and computing it for any column requires the similarities of y_j to every image in the batch. In a distributed setting both rows and columns are split across devices, so every device must gather partial similarities from every other device. The cost of this all-gather and the memory needed to materialize the full $N \times N$ matrix grow as $O(N^2)$ , so batch size is bounded by the per-device memory available for the similarity matrix as well as by the model parameters and activations.^[1]^[22]

The sigmoid loss makes each cell (i, j) contribute independently. The gradient of the binary cross-entropy at logit $u$ is $\sigma(u) - y$ , where $y$ is the binary target, so per-cell updates can be computed locally on whichever device holds (x_i, y_j) without global communication.^[1]^[21] In the paper the authors show that this enables a "chunked" implementation in which sub-batches of text features are exchanged between devices in a ring, the sigmoid loss is accumulated piece by piece, and per-device memory remains bounded by the local chunk size $b^2$ rather than the global batch size $N^2$ .^[1]^[22] On four TPUv4 chips, the chunked implementation fit a global batch of 4,096 image-text pairs for a SigLIP-Base model, whereas a comparable CLIP-Base model could only reach a global batch of 2,048 under identical hardware and architecture.^[22]

What batch size does SigLIP need?

A second consequence of the sigmoid formulation is that it is well-defined for any batch size, including batch size 1, whereas the softmax loss is degenerate at very small batch sizes (a single positive example would contrast against only itself). The SigLIP authors exploited this property to perform a careful study of batch-size dependence, scanning from very small batches up to one million examples.^[1]^[21] Their empirical finding was that both losses improve as batch size grows but that gains saturate near 32k pairs, and that beyond about 256k batch size very large batches can hurt rather than help.^[1]^[21] At small batch sizes (below 16k) the sigmoid loss outperforms softmax by a clear margin, which is the regime relevant for academic and small-industry training runs.^[1]^[8] Specifically, the paper reports that the sigmoid loss reaches peak ImageNet zero-shot accuracy at 32k batch size, while the softmax loss only catches up at a batch size of 98k and still trails the sigmoid result.^[1]^[22] The same 32k saturation point holds for multilingual training on over 100 languages.^[1]

Robustness, noise, and theoretical analysis

The sigmoid loss has also been studied for its robustness properties. Per-pair independence makes the loss less sensitive to mislabeled pairs in noisy web data: a single false positive contributes only to its own cell rather than affecting normalization in every other row and column.^[21] Subsequent theoretical work has examined the global minimizers of sigmoid contrastive loss and the geometric properties of the learned embedding space, formalizing the conditions under which sigmoid and softmax contrastive losses yield equivalent representations and the regimes in which they diverge.^[23]

SigLiT: Locked-image Tuning with sigmoid loss

SigLiT is the variant in which the image encoder is initialized from a pretrained ViT classifier (typically trained on JFT or ImageNet) and frozen, and only the text encoder is trained from scratch using the sigmoid loss.^[1]^[18] This builds directly on the earlier Locked-image Tuning (LiT) recipe by the same authors, in which the pretrained vision tower acts as a fixed "teacher" feature extractor and the text tower learns to align to its representation space.^[18] The SigLIP paper reports that, combined with the sigmoid loss and Locked-image Tuning, a SigLiT model that reuses a g/14 vision checkpoint, trained on four TPUv4 chips for two days, reaches 84.5% zero-shot top-1 accuracy on ImageNet, demonstrating that the sigmoid formulation enables strong contrastive models without large-scale compute.^[1]^[8] A smaller Base/8 + L/16 SigLiT configuration reaches 79.7% on the same setup in one day.^[1]

Full SigLIP training

In the full (non-locked) setting both encoders are trained jointly. SigLIP models are typically based on Vision Transformer backbones at Base, Large, and shape-optimized 400M ("So400m") scales. The shape-optimized 400M architecture comes from a separate Google paper, "Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design" (arXiv:2305.13035) by Alabdulmohsin, Zhai, Kolesnikov, and Beyer, which derived the SoViT-400M shape (width 1152, depth 27, MLP dimension 4304) by jointly tuning depth, width, and MLP dimension under compute-optimal scaling laws.^[9] SoViT-400M/14 was reported to reach 90.3% fine-tuning accuracy on ImageNet, surpassing the much larger ViT-g/14 and approaching ViT-G/14 at less than half the inference cost.^[9] The SigLIP-So400m-patch14-384 checkpoint, the most widely deployed variant of the original SigLIP, was trained on 16 TPUv4 chips for three days and processes 384x384 images as 729 patch tokens of size 14x14 with mean and standard deviation (0.5, 0.5, 0.5) for image normalization.^[3] Text inputs are tokenized to a maximum sequence length of 64 tokens.^[3]

What data is SigLIP trained on?

Both SigLIP and SigLIP 2 were trained on WebLI, a Google-internal web-scale image-text dataset introduced in the 2022 PaLI paper by Chen et al.^[4] WebLI contains approximately 10 billion images paired with 12 billion alt-texts spanning 109 languages, drawn from public web pages and filtered for safety, deduplication, and basic quality criteria.^[4]^[5] The original SigLIP was trained primarily on the English subset of WebLI, while SigLIP 2 used a more diverse 90% English plus 10% non-English mixture, which underpins its improved multilingual performance.^[2]^[5] WebLI's image-text pairs are filtered using prior image-text alignment models, and later iterations include both noisy alt-text captions and machine-generated captions.^[4]^[5] SigLIP 2 also adopts the multilingual Gemma tokenizer (vocabulary size 256k), which gives the text encoder full coverage of the language mixture in WebLI rather than the English-only tokenizer used by SigLIP 1.^[5]

The non-public nature of WebLI is one of the principal differences between SigLIP and open replications such as OpenCLIP, which train on the public LAION-2B and LAION-5B datasets. Researchers cannot exactly replicate Google's results without access to WebLI, although the SigLIP model checkpoints themselves are released under permissive licenses on Hugging Face.^[3]^[15]

What is SigLIP 2?

Released on 2025-02-20 as arXiv:2502.14786 and titled "SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features," SigLIP 2 is described by its authors as "a family of new multilingual vision-language encoders that build on the success of the original SigLIP."^[2] The new recipe combines the original sigmoid contrastive objective with several previously independent training signals: "captioning-based pretraining, self-supervised losses (self-distillation, masked prediction) and online data curation."^[2] The authors report that "SigLIP 2 models outperform their SigLIP counterparts at all model scales in core capabilities," with the headline additions of captioning objectives, self-distillation losses, online data curation, and multilingual data, plus two architectural variants (fixed resolution and NaFlex) for each model size.^[2]^[10]^[24]

Training recipe

SigLIP 2 trains the standard image and text encoders with the original sigmoid loss, augmented by three additional families of losses:^[2]^[10]

Decoder objectives. Early in training, a lightweight transformer decoder is attached to the visual encoder and trained to predict (a) a holistic caption for the image, (b) bounding-box coordinates for image regions, and (c) region-specific captions conditioned on bounding boxes. These auxiliary tasks build directly on the CapPa and LocCa lines of work and inject location and language grounding into the visual representation.^[2]^[10]^[19]^[20]
Self-distillation losses. After 80% of training, two self-distillation signals are added: a global-local loss in which a student network must match a teacher (an exponential moving average of itself) on partial image views, and a masked-prediction loss in which 50% of image patches are masked and the student must reconstruct the teacher's features at the masked positions. These are derived from the DINO and DINOv2 line of self-supervised work and target dense, local features.^[2]^[10]^[25]
Online data curation and de-biasing. The training data mixture incorporates online filtering (continuously rescoring and reweighting training pairs based on current model predictions) and explicit de-biasing techniques applied to socially sensitive concepts. The authors report that this curation is especially valuable for smaller models, which benefit disproportionately from distilled high-quality training samples.^[2]^[24]

The combined recipe is run on up to 2,048 TPUv5e chips with a fully-sharded data-parallel strategy at a global batch size of 32,000 for a total of 40 billion examples seen.^[11] The optimizer is AdamW, the global learning rate uses the standard linear warmup followed by cosine decay, and the training pipeline is implemented in the open-source big_vision codebase that Google Research maintains in JAX.^[5]^[11]

Model variants

SigLIP 2 releases checkpoints at four sizes (ViT-B with 86 million parameters, ViT-L with 303 million, ViT-So400m with 400 million, and ViT-g with 1 billion) and at multiple input resolutions (commonly 224, 256, 384, and 512 pixels).^[2]^[10] For each of Base, Large, and So400m, two architectural families are released:^[10]

Fixed-resolution variants resize input images to a fixed square resolution and use standard ViT positional embeddings, matching the conventions of SigLIP 1. Resolution-specific checkpoints are produced by resizing the positional and patch embeddings at the 95% training mark and continuing training for the remaining steps.^[10]
NaFlex variants support multiple predefined sequence lengths in a single model and preserve the native aspect ratio of the input image. NaFlex combines techniques from FlexiViT (Beyer et al., CVPR 2023), which allows a single ViT to operate at varying patch grid sizes, and NaViT (Dehghani et al., NeurIPS 2023), which packs images of different aspect ratios into a single sequence with masked attention and masked pooling.^[10]^[26]^[27] Given a target sequence length and a patch size, the NaFlex preprocessor resizes the image so that height and width are integer multiples of the patch size, the aspect ratio is preserved as closely as possible, and the resulting sequence length is at most the target. NaFlex variants are particularly suited to OCR and document understanding, where aspect-ratio distortion harms text legibility.^[10]^[24]

The full lineup is therefore: Base/32 at 256, Base/16 at 224, 256, 384, and 512, Large/16 at 256, 384, and 512, SO400M/14 at 224 and 384, SO400M/16 at 256, 384, and 512, and giant/16 at 256 and 384, each in fixed and (where applicable) NaFlex form.^[10]

How does SigLIP 2 compare to SigLIP 1?

The SigLIP 2 paper reports improvements over the original SigLIP across nearly all evaluation categories. Selected ImageNet-1k zero-shot accuracy numbers from the paper include:^[11]

Setting	SigLIP	SigLIP 2
ImageNet-1k zero-shot, ViT-B/16 at 256	76.7%	79.1%
ImageNet-1k zero-shot, ViT-L/16 at 256	n/a	82.5%
ImageNet-1k zero-shot, ViT-So400m/14 at 384	n/a	84.1%
ImageNet-1k zero-shot, ViT-g/16 at 256	n/a	84.5%
ImageNet-v2 zero-shot, ViT-L/16 at 256	74.2%	76.8%
COCO text-to-image Recall@1, ViT-L/16 at 256	81.3%	84.1%
PASCAL semantic segmentation mIoU, ViT-So/14 at 384	73.8	78.1
NYUv2 depth estimation RMSE, ViT-So/14 (lower is better)	0.563	0.466
Representation bias (lower is better), ViT-L/16 at 256	35.5%	7.3%

On the Crossmodal-3600 (XM3600) multilingual retrieval benchmark, which covers 36 languages, SigLIP 2 ViT-B/16 at 256 pixels improves recall from 22.5% under SigLIP to 40.7%, approaching the performance of the multilingual mSigLIP variant despite a more balanced 90/10 English/multilingual mixture.^[11]^[28] At ViT-L/16 and 256 pixels, the XM3600 recall@1 reaches 46.5%, and the gap to mSigLIP closes further at larger scales.^[11] The paper attributes the localization and dense-prediction improvements primarily to the decoder and self-distillation objectives, and the multilingual and fairness improvements primarily to the data mixture and de-biasing.^[2]^[11]

Active distillation and small-model performance

A second design choice that the paper emphasizes is active distillation through online data curation. During training, training pairs are scored by an EMA copy of the model and re-weighted so that smaller models effectively learn from a curriculum of progressively more informative pairs.^[2]^[24] The authors report that this benefits small variants disproportionately: the Base SigLIP 2 model closes a noticeable fraction of the gap to Large and So400m on ImageNet zero-shot and on retrieval benchmarks, which previous CLIP-style training had not achieved.^[2]^[24]

Backward compatibility and tokenization

SigLIP 2 uses the same dual-encoder architecture as SigLIP 1, the same patch sizes, and the same JAX/big_vision training stack, so downstream code that consumed SigLIP-So400m checkpoints can swap in a SigLIP 2-So400m checkpoint without re-engineering the pipeline.^[10]^[15] The main upgrade visible to downstream users is the multilingual Gemma tokenizer (256k vocabulary), which replaces the English-only SentencePiece tokenizer of SigLIP 1 and allows the text encoder to handle non-Latin scripts and multi-byte tokens without re-tuning.^[5]^[10]

What is SigLIP used for as a vision encoder?

SigLIP has been adopted as the default visual encoder in a growing list of open and proprietary vision-language models, often replacing OpenAI's CLIP image tower. The shape-optimized SigLIP-So400m checkpoint is particularly widely used because its 400 million parameters strike a favorable balance between representation quality and inference cost.^[3]^[15]

PaliGemma and PaliGemma 2

PaliGemma, released by Google on 2024-05-14 as arXiv:2407.07726, is a 3-billion-parameter vision-language model that pairs a SigLIP-So400m vision encoder with a Gemma 2B language model.^[12]^[29] Images are encoded into patch tokens by SigLIP at 224x224, 448x448, or 896x896, projected by a linear adapter from the SigLIP output dimension of 1152 to the Gemma embedding dimension of 2048, and concatenated with text tokens as input to the Gemma decoder.^[12]^[29] At 224 resolution the model produces 256 image tokens; at 448, 1,024 tokens; at 896, 4,096 tokens. PaliGemma was evaluated on roughly 40 transfer tasks, including standard image captioning and visual question answering benchmarks as well as more specialized domains like remote sensing and segmentation, and serves as a base model for transfer learning rather than as an end-user chatbot.^[12]^[29] PaliGemma 2, released later in 2024, upgrades the decoder to the Gemma 2 family (with 2B, 9B, and 27B variants) while continuing to use SigLIP-family vision encoders.^[13]^[30]

Gemma 3 and the Pan and Scan algorithm

The vision-capable Gemma 3 models (March 2025) use a 400M variant of the SigLIP encoder operating at a fixed resolution of 896x896 and producing a sequence of 256 dense "soft tokens" per image.^[31] To handle high-resolution and non-square images without degrading performance, Gemma 3 adds a Pan and Scan algorithm at inference time: the image is segmented into non-overlapping crops of equal size that together cover the image, and each crop is resized to 896x896 and passed through the encoder separately.^[31] The Pan and Scan algorithm specifically helps with tasks involving non-square aspect ratios, high-resolution photos, and embedded text reading, addressing a limitation of fixed-resolution SigLIP.^[31]

LLaVA-OneVision and other LLaVA derivatives

In the open-source community, the LLaVA series of vision-language models migrated from OpenAI CLIP to SigLIP image encoders as their default. LLaVA-OneVision, an August 2024 release in the LLaVA family, uses a SigLIP-So400M encoder paired with a Qwen2 language backbone, and several research VLMs (LLaVA-MORE, LLaVA-NeXT variants, and others) similarly adopt SigLIP-So400m or SigLIP 2 checkpoints as their vision tower.^[14] The widespread switch is generally attributed to SigLIP's higher zero-shot quality and to the convenience of the 400M shape-optimized scale, which is small enough to be tractable inside a multimodal pipeline while strong enough to set the state of the art on transfer benchmarks.^[14]

Idefics2 and Idefics3

The Hugging Face team's Idefics2 (released 2024-05-03) and Idefics3 (released 2024-08-22) open vision-language models both use SigLIP-So400m-patch14-384 as the image tower.^[32]^[33] Idefics2 pairs the encoder with Mistral 7B and supports two training stages: a first stage at SigLIP's native 384x384 and a second stage at native aspect ratios up to 980 pixels on the long side.^[32] Idefics3 swaps the language backbone to Llama 3.1 Instruct and adds a pixel-shuffle strategy that reduces visual tokens to 169 while supporting input resolutions up to roughly 364x364 per tile.^[33] Both models report Apache 2.0 licenses and are widely used as research baselines for instruction-following vision-language models.^[32]^[33]

Qwen-VL and other Asian-origin models

The Qwen-VL family from Alibaba initially used CLIP-style encoders but later versions (Qwen2.5-VL and later) shifted toward dynamic-resolution Native-ViT designs influenced by NaViT and SigLIP's NaFlex.^[34] While Qwen2-VL itself uses a 675M Vision Transformer trained from scratch rather than a SigLIP checkpoint, the design space in which it operates was shaped by SigLIP's results on smaller batch sizes and the NaFlex/NaViT aspect-ratio handling.^[34]

SmolVLM

The SmolVLM family from Hugging Face (released January 2025) explicitly targets edge devices and laptops, with 256M and 500M variants that fit in under 1 GB of RAM.^[35] SmolVLM-256M uses a reduced 93M-parameter SigLIP vision encoder paired with the SmolLM2 text decoder; the 500M variant uses the full SigLIP-So400M vision tower.^[35] Training data is drawn from The Cauldron and Docmatix, weighted toward document understanding and image captioning, demonstrating that SigLIP-based VLMs can be compressed into small footprints suitable for on-device use.^[35]

MedSigLIP and medical applications

MedSigLIP (released 2025-07-09) is a domain-adapted SigLIP variant from Google Health AI Developer Foundations, consisting of a 400M-parameter vision encoder and 400M-parameter text encoder trained at 448x448 resolution.^[36] The training data combines de-identified medical images (chest X-rays, dermatology, ophthalmology, histopathology slides, CT and MRI slices) with natural images and paired text reports, so that the model retains general visual capabilities while gaining domain-specific knowledge.^[36] MedSigLIP is distributed through Google Cloud Model Garden and Hugging Face under research-friendly licenses and is recommended for data-efficient classification, zero-shot classification, and semantic image retrieval in medical workflows.^[36]

Vertex AI multimodal embeddings

Google's Vertex AI multimodal embeddings API exposes a managed multimodal embedding model (the public multimodalembedding@001 endpoint, producing 1408-dimension vectors) that supports image, video, and text inputs in a shared embedding space.^[37] SigLIP-family encoders are also distributed through Vertex AI Model Garden as part of the SigLIP 2, MedSigLIP, and PaliGemma releases, making them deployable as managed endpoints for enterprise customers in addition to the open-weight Hugging Face downloads.^[37]

Hugging Face Transformers, timm, and OpenCLIP

SigLIP and SigLIP 2 vision encoders are integrated into the Hugging Face Transformers library (under the SiglipModel and Siglip2Model classes) and into ecosystems such as timm and OpenCLIP weight conversions, making them straightforward to drop into new vision-language pipelines.^[10]^[15] OpenCLIP includes a SigLIPTask subclass for training SigLIP-style models, and timm hosts converted PyTorch checkpoints under timm/ViT-B-16-SigLIP, timm/ViT-L-16-SigLIP-384, and related identifiers.^[38] Image classification and zero-shot retrieval applications using SigLIP have been packaged in Hugging Face Transformers pipelines under the zero-shot-image-classification task. Diffusion-based image generation systems and image-conditioned editing pipelines have experimented with SigLIP image embeddings as an alternative to CLIP image embeddings, though Stable Diffusion 3 and most other major diffusion models continue to rely on CLIP or T5 text encoders rather than SigLIP for text conditioning.^[15]

How does SigLIP differ from CLIP?

The most direct comparison for SigLIP is OpenAI's CLIP, from which it differs principally in the choice of loss function. Other contrastive learning vision-language models in the same lineage include OpenCLIP (an open replication of CLIP trained on LAION), EVA-CLIP (a series of scaled CLIP models from BAAI), and DFN (Data Filtering Networks). The following table summarizes some of the distinguishing characteristics:

Aspect	CLIP^[16]	SigLIP^[1]	SigLIP 2^[2]
Year	2021	2023	2025
Authors	OpenAI	Google DeepMind	Google DeepMind
Contrastive loss	Symmetric softmax cross-entropy	Pairwise sigmoid with learnable bias	Sigmoid + decoder + self-distillation + curation
Loss requires global batch normalization	Yes	No	No
Reported optimal batch size	Large (32k+ in practice)	Saturates around 32k; competitive at 8k	32k (paper standard)
Native multilingual training	No (English)	Limited (English-dominant WebLI)	Yes (90/10 English/non-English)
Native aspect-ratio variants	No	No	Yes (NaFlex)
Public training data	Closed (~400M pairs)	Closed (WebLI)	Closed (WebLI)
Largest released vision tower	ViT-L/14 (~300M params)	ViT-So400m, ViT-L	ViT-g (1B)
Memory per device (batch N)	$O(N^2)$	$O(b^2)$ with chunking	$O(b^2)$ with chunking
Tokenizer	English-only (49k)	English-only (32k)	Multilingual Gemma (256k)
Typical default in 2025 VLMs	Legacy	Common (So400m)	Increasingly common

The relationship to OpenCLIP is similar to that with CLIP: OpenCLIP uses softmax contrastive loss and public data, while SigLIP uses sigmoid loss and proprietary data.^[16] EVA-CLIP scales CLIP-style training to billions of parameters with curated improvements (masked-image-modeling initialization, optimization tricks) but retains the softmax loss.^[17] DFN (Data Filtering Networks) is another contemporary contrastive vision-language line that uses CLIP-style loss but emphasizes filtered training data; some DFN checkpoints have been compared head-to-head against SigLIP in vision-language model benchmarks.^[14]

Comparison of loss formulations

The most concise way to see the difference between CLIP and SigLIP is to write both losses on the same batch of logits $L_{ij} = t s_{ij}$ (or $t s_{ij} + b$ for SigLIP):

CLIP softmax (image-to-text direction only): $-\log\left(\frac{\exp(L_{ii})}{\sum_j \exp(L_{ij})}\right)$ .

SigLIP sigmoid: $-\log(\sigma(z_{ii} L_{ii}))$ when $i = j$ and $-\log(\sigma(z_{ij} L_{ij}))$ when $i \ne j$ , with $z_{ii} = +1$ and $z_{ij} = -1$ .

The numerator in the softmax depends only on the positive pair, but the denominator depends on the entire row through $\sum_j \exp(L_{ij})$ . In SigLIP the loss for cell $(i, j)$ only depends on $L_{ij}$ , so a gradient update for one cell does not affect any other.^[1]^[21] This is what makes the chunked all-gather implementation and the small-batch behavior possible.^[22]

Why is SigLIP significant?

SigLIP's significance has both methodological and practical components.

Methodologically, it demonstrated that softmax normalization is not intrinsic to contrastive learning, and that a per-pair sigmoid loss can match or exceed softmax performance while removing the global-batch synchronization constraint. The result reshaped how researchers think about the relationship between batch size and contrastive learning: where the prevailing wisdom from CLIP was that more negatives are always better, SigLIP showed that this benefit saturates around 32k and can reverse at extreme batch sizes.^[1]^[8] The careful batch-size scan in the paper has become a frequently cited reference in subsequent contrastive learning work, and follow-up papers have analyzed the embedding geometry and theoretical properties of sigmoid contrastive losses.^[23]

Practically, SigLIP and SigLIP 2 produced vision encoders that, together with their permissive open-weight release on Hugging Face, became the default image tower for a large fraction of modern open vision-language models. PaliGemma, PaliGemma 2, Gemma 3 VL, LLaVA-OneVision, Idefics2, Idefics3, SmolVLM, and many other VLMs use SigLIP-family checkpoints, and SigLIP 2's NaFlex variants address a long-standing pain point in document and OCR pipelines by removing forced square cropping.^[10]^[12]^[14]^[31]^[32]^[33]^[35] The 400M shape-optimized "So400m" backbone in particular has become a kind of industry standard for the vision side of small and mid-scale VLMs, partly because of SigLIP's reported numbers and partly because Gemma and PaliGemma made the checkpoint extremely convenient to consume.^[12]^[29]

The SigLIP 2 paper also surfaced fairness and multilingual concerns that the original SigLIP did not address explicitly. The reported reduction of representation bias from 35.5% to 7.3% (on the L/16 model at 256 sequence length) and the improved performance on Crossmodal-3600 across 36 languages are presented as evidence that contrastive vision-language training can be improved on these axes by changes to data mixture and explicit de-biasing, without sacrificing benchmark accuracy.^[11]^[28]

Limitations and criticisms

Several limitations of SigLIP are commonly noted in the literature.

Closed training data. Both SigLIP and SigLIP 2 are trained on WebLI, a Google-internal dataset that has never been released publicly. This prevents external researchers from reproducing or auditing the training pipeline and means that SigLIP cannot be used as a baseline in studies that require fully open training data. Open replications using LAION or DataComp would need to be performed independently and cannot strictly reproduce the published numbers.^[4]^[5]

Limited multilingual coverage in original SigLIP. Although WebLI includes 109 languages, the original SigLIP paper focused on English data, and multilingual retrieval performance was significantly below what SigLIP 2 later achieved.^[2]^[11] Users who needed multilingual coverage before SigLIP 2 had to use the separate mSigLIP variant or other models.

Architectural conservatism. SigLIP retains the dual-encoder architecture from CLIP (a separate vision tower and a separate text tower with no cross-attention between them at training time). This is a deliberate choice in service of fast retrieval, but it caps the kind of fine-grained image-text interaction the model can learn, in contrast to CoCa, BLIP-2, and other models that use cross-attention or generative captioning objectives. SigLIP 2 partly addresses this by adding decoder-based captioning, but the headline checkpoints remain dual-encoder for inference.^[2]

Dense and localization features in SigLIP 1. The original SigLIP was trained with a purely global objective (matched-pair classification) and was not optimized for dense per-patch features. The SigLIP 2 paper itself reports that adding decoder objectives and self-distillation losses substantially improves segmentation and depth-estimation transfer numbers (PASCAL mIoU rising from 73.8 to 78.1 on So/14 at 384), implicitly highlighting the dense-feature weakness of the original model.^[11]

Batch-size sensitivity. Although the sigmoid loss removes the strict need for very large batches, SigLIP and SigLIP 2 still report best results at batch sizes of tens of thousands and tens of billions of training examples. Reaching the published numbers therefore still requires substantial compute, even if it is less than CLIP-scale.^[1]^[11]

Bias and societal impact. While SigLIP 2 substantially reduces a specific measure of representation bias, the underlying training data is still web-crawled and inherits the biases of the open web. The SigLIP 2 paper acknowledges that explicit de-biasing focuses on a specific set of socially sensitive concepts and is not a general solution to fairness concerns in vision-language models.^[2]

Fixed-resolution legacy and document distortion. SigLIP 1 imposes a fixed square resolution per checkpoint (e.g., 384 for SigLIP-So400m-patch14-384), which forces aspect-ratio distortion on non-square images. This is detrimental for document images, screenshots, and OCR tasks. SigLIP 2 NaFlex variants and Gemma 3's Pan and Scan workaround were designed to address this limitation, but it remained the primary motivation for many downstream pipelines to add tiling or letter-boxing layers on top of SigLIP 1 outputs.^[10]^[31]

SigLIP sits at the intersection of contrastive pre-training, the Vision Transformer line, and large multimodal models. Closely related precursors and contemporaries include:

CLIP (Contrastive Language-Image Pre-training), the original softmax-based contrastive image-text model from OpenAI, which SigLIP directly reformulates.^[16]
Contrastive learning more broadly, including SimCLR, MoCo, and BYOL on the vision side and approaches like ALIGN on the multimodal side.
The Vision Transformer (ViT) architecture, which provides the image encoder backbone for SigLIP, SigLIP 2, CLIP, and most modern visual foundation models.
Locked-image Tuning (LiT) from Zhai et al. 2022, the direct predecessor of SigLiT.^[18]
The Knowledge Distillation and self-distillation techniques (DINO, DINOv2) that SigLIP 2 incorporates via its global-local and masked-prediction losses.^[25]
Auxiliary captioning losses such as CapPa and LocCa, integrated into SigLIP 2 as decoder objectives.^[19]^[20]
The masked autoencoder (MAE) line of self-supervised pre-training, whose masking strategy is conceptually related to SigLIP 2's masked-prediction loss.
FlexiViT and NaViT, the patch-size and aspect-ratio innovations that together underpin SigLIP 2's NaFlex variants.^[26]^[27]

On the downstream side, SigLIP image encoders feed into LLaVA-family and Gemma/Gemma 2/Gemma 3-based multimodal systems such as PaliGemma, as well as a wide range of multimodal AI research models trained on top of frozen SigLIP visual features.^[12]^[13]^[14]^[31]

References

Zhai, Xiaohua; Mustafa, Basil; Kolesnikov, Alexander; Beyer, Lucas. "Sigmoid Loss for Language Image Pre-Training", arXiv, 2023-03-27. https://arxiv.org/abs/2303.15343. Accessed 2026-05-24. ↩
Tschannen, Michael; Gritsenko, Alexey; Wang, Xiao; Naeem, Muhammad Ferjad; Alabdulmohsin, Ibrahim; Parthasarathy, Nikhil; Evans, Talfan; Beyer, Lucas; Xia, Ye; Mustafa, Basil; Henaff, Olivier; Harmsen, Jeremiah; Steiner, Andreas; Zhai, Xiaohua. "SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features", arXiv, 2025-02-20. https://arxiv.org/abs/2502.14786. Accessed 2026-05-24. ↩
Google. "google/siglip-so400m-patch14-384 (model card)", Hugging Face, 2024. https://huggingface.co/google/siglip-so400m-patch14-384. Accessed 2026-05-24. ↩
Chen, Xi; Wang, Xiao; Changpinyo, Soravit; et al. "PaLI: A Jointly-Scaled Multilingual Language-Image Model", arXiv, 2022-09-14. https://arxiv.org/abs/2209.06794. Accessed 2026-05-24. ↩
Google Research. "Big Vision repository: SigLIP 2 configuration", GitHub, 2025. https://github.com/google-research/big_vision/blob/main/big_vision/configs/proj/image_text/README_siglip2.md. Accessed 2026-05-24. ↩
Radford, Alec; et al. "Learning Transferable Visual Models From Natural Language Supervision", arXiv, 2021-02-26. https://arxiv.org/abs/2103.00020. Accessed 2026-05-24. ↩
Zhai, Xiaohua; Mustafa, Basil; Kolesnikov, Alexander; Beyer, Lucas. "Sigmoid Loss for Language Image Pre-Training", Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) 2023. https://openaccess.thecvf.com/content/ICCV2023/html/Zhai_Sigmoid_Loss_for_Language_Image_Pre-Training_ICCV_2023_paper.html. Accessed 2026-05-24. ↩
Hugging Face. "Paper page: Sigmoid Loss for Language Image Pre-Training", Hugging Face Papers, 2023. https://huggingface.co/papers/2303.15343. Accessed 2026-05-24. ↩
Alabdulmohsin, Ibrahim; Zhai, Xiaohua; Kolesnikov, Alexander; Beyer, Lucas. "Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design", arXiv, 2023-05-22. https://arxiv.org/abs/2305.13035. Accessed 2026-05-24. ↩
Hugging Face. "SigLIP 2: A better multilingual vision language encoder", Hugging Face Blog, 2025-02-21. https://huggingface.co/blog/siglip2. Accessed 2026-05-24. ↩
Tschannen, Michael; et al. "SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features" (HTML version with tables), arXiv, 2025-02-20. https://arxiv.org/html/2502.14786v1. Accessed 2026-05-24. ↩
Beyer, Lucas; Steiner, Andreas; Pinto, Andre Susano; Kolesnikov, Alexander; et al. "PaliGemma: A versatile 3B VLM for transfer", arXiv, 2024-07-10. https://arxiv.org/abs/2407.07726. Accessed 2026-05-24. ↩
Hugging Face. "PaliGemma model documentation", Hugging Face Transformers docs, 2024. https://huggingface.co/docs/transformers/model_doc/paligemma. Accessed 2026-05-24. ↩
Hugging Face. "LLaVA-OneVision model documentation", Hugging Face Transformers docs, 2024. https://huggingface.co/docs/transformers/model_doc/llava_onevision. Accessed 2026-05-24. ↩
Hugging Face. "SigLIP2 model documentation", Hugging Face Transformers docs, 2025. https://huggingface.co/docs/transformers/model_doc/siglip2. Accessed 2026-05-24. ↩
Radford, Alec; Kim, Jong Wook; Hallacy, Chris; et al. "Learning Transferable Visual Models From Natural Language Supervision (CLIP)", arXiv, 2021-02-26. https://arxiv.org/abs/2103.00020. Accessed 2026-05-24. ↩
Sun, Quan; Fang, Yuxin; Wu, Ledell; Wang, Xinlong; Cao, Yue. "EVA-CLIP: Improved Training Techniques for CLIP at Scale", arXiv, 2023-03-27. https://arxiv.org/abs/2303.15389. Accessed 2026-05-24. ↩
Zhai, Xiaohua; Wang, Xiao; Mustafa, Basil; Susano Pinto, Andre; Dosovitskiy, Alexey; Kolesnikov, Alexander; Steiner, Andreas; Beyer, Lucas. "LiT: Zero-Shot Transfer with Locked-image text Tuning", arXiv, 2021-11-15. https://arxiv.org/abs/2111.07991. Accessed 2026-05-24. ↩
Tschannen, Michael; Kumar, Manoj; Steiner, Andreas; Zhai, Xiaohua; Houlsby, Neil; Beyer, Lucas. "Image Captioners Are Scalable Vision Learners Too (CapPa)", arXiv, 2023-06-13. https://arxiv.org/abs/2306.07915. Accessed 2026-05-24. ↩
Wan, Bo; Tschannen, Michael; et al. "LocCa: Visual Pretraining with Location-aware Captioners", arXiv, 2024-03-28. https://arxiv.org/abs/2403.19596. Accessed 2026-05-24. ↩
Taha, Ahmed. "Sigmoid Loss for Language Image Pre-Training (technical walkthrough)", Medium, 2023-04-04. https://ahmdtaha.medium.com/sigmoid-loss-for-language-image-pre-training-2dd5e7d1af84. Accessed 2026-05-24. ↩
Kyouma45. "SigLIP vs. CLIP: Overcoming the Softmax Bottleneck in Vision-Language Models", Towards AI, 2024. https://pub.towardsai.net/siglip-vs-clip-overcoming-the-softmax-bottleneck-in-vision-language-models-95fae9db6570. Accessed 2026-05-24. ↩
Wang, Xiao; et al. "Global Minimizers of Sigmoid Contrastive Loss", arXiv, 2025. https://arxiv.org/pdf/2509.18552. Accessed 2026-05-24. ↩
MarkTechPost. "Google DeepMind Research Releases SigLIP2: A Family of New Multilingual Vision-Language Encoders", MarkTechPost, 2025-02-21. https://www.marktechpost.com/2025/02/21/google-deepmind-research-releases-siglip2-a-family-of-new-multilingual-vision-language-encoders-with-improved-semantic-understanding-localization-and-dense-features/. Accessed 2026-05-24. ↩
Oquab, Maxime; Darcet, Timothee; et al. "DINOv2: Learning Robust Visual Features without Supervision", arXiv, 2023-04-14. https://arxiv.org/abs/2304.07193. Accessed 2026-05-24. ↩
Beyer, Lucas; Izmailov, Pavel; Kolesnikov, Alexander; et al. "FlexiViT: One Model for All Patch Sizes", arXiv, 2022-12-15. https://arxiv.org/abs/2212.08013. Accessed 2026-05-24. ↩
Dehghani, Mostafa; Mustafa, Basil; et al. "Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution", arXiv, 2023-07-12. https://arxiv.org/abs/2307.06304. Accessed 2026-05-24. ↩
Thapliyal, Ashish V.; Pont-Tuset, Jordi; Chen, Xi; Soricut, Radu. "Crossmodal-3600: A Massively Multilingual Multimodal Evaluation Dataset", arXiv, 2022-05-25. https://arxiv.org/abs/2205.12522. Accessed 2026-05-24. ↩
Hugging Face. "PaliGemma - Google's Cutting-Edge Open Vision Language Model", Hugging Face Blog, 2024-05-14. https://huggingface.co/blog/paligemma. Accessed 2026-05-24. ↩
Hugging Face. "Welcome PaliGemma 2 - New vision language models by Google", Hugging Face Blog, 2024-12-05. https://huggingface.co/blog/paligemma2. Accessed 2026-05-24. ↩
Google. "Gemma 3 Technical Report", arXiv, 2025-03-25. https://arxiv.org/abs/2503.19786. Accessed 2026-05-24. ↩
Laurencon, Hugo; Tronchon, Leo; Cord, Matthieu; Sanh, Victor. "Introducing Idefics2: A Powerful 8B Vision-Language Model for the community", Hugging Face Blog, 2024-04-15. https://huggingface.co/blog/idefics2. Accessed 2026-05-24. ↩
Hugging Face. "Idefics3 model documentation", Hugging Face Transformers docs, 2024. https://huggingface.co/docs/transformers/model_doc/idefics3. Accessed 2026-05-24. ↩
Wang, Peng; Bai, Shuai; et al. "Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution", arXiv, 2024-09-18. https://arxiv.org/abs/2409.12191. Accessed 2026-05-24. ↩
Hugging Face. "SmolVLM-Instruct model card", Hugging Face Hub, 2025-01-23. https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct. Accessed 2026-05-24. ↩
Google for Developers. "MedSigLIP model card", Google Health AI Developer Foundations, 2025-07-09. https://developers.google.com/health-ai-developer-foundations/medsiglip/model-card. Accessed 2026-05-24. ↩
Google Cloud. "Multimodal embeddings API - Generative AI on Vertex AI", Google Cloud Documentation, 2025. https://docs.cloud.google.com/vertex-ai/generative-ai/docs/embeddings. Accessed 2026-05-24. ↩
OpenCLIP. "open_clip: An open source implementation of CLIP, with SigLIP integration", GitHub, 2024. https://github.com/mlfoundations/open_clip. Accessed 2026-05-24. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributor · full history

Suggest edit

What links here

CLIP (Contrastive Language-Image Pre-training)DINOv2 DatologyAI DeepSeek Janus DeepSeek-VL DeepSeek-VL2 Embedding Space Gemma 4 MedGemma MiniCPM-V OpenVLA PaliGemma Pixtral Qwen3-VL Robot foundation model Tower Vector embeddings Vision Transformer Vision language model

What problem does SigLIP solve?

Predecessor work at Google Brain and Google DeepMind

How does the sigmoid loss work?

Mathematical contrast with the softmax contrastive loss

What batch size does SigLIP need?

Robustness, noise, and theoretical analysis

SigLiT: Locked-image Tuning with sigmoid loss

Full SigLIP training

What data is SigLIP trained on?

What is SigLIP 2?

Training recipe

Model variants

How does SigLIP 2 compare to SigLIP 1?

Active distillation and small-model performance

Backward compatibility and tokenization

What is SigLIP used for as a vision encoder?

PaliGemma and PaliGemma 2

Gemma 3 and the Pan and Scan algorithm

LLaVA-OneVision and other LLaVA derivatives

Idefics2 and Idefics3

Qwen-VL and other Asian-origin models

SmolVLM

MedSigLIP and medical applications

Vertex AI multimodal embeddings

Hugging Face Transformers, timm, and OpenCLIP

How does SigLIP differ from CLIP?

Comparison of loss formulations

Why is SigLIP significant?

Limitations and criticisms

Related work

See also

References

Improve this article

Related Articles

ERQA

PaLM-E: An Embodied Multimodal Language Model

SmolVLA

Gemini 3

Gemma 3

Gemini 2.5 Flash

What links here

Related Articles

ERQA

PaLM-E: An Embodied Multimodal Language Model

SmolVLA

Gemini 3

Gemma 3

Gemini 2.5 Flash

What links here