SigLIP
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 ยท 6,578 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 ยท 6,578 words
Add missing citations, update stale details, or suggest a clearer explanation.
SigLIP (Sigmoid Loss for Language-Image Pre-training) is a family of vision-language encoders developed by researchers at Google DeepMind that pre-trains image and text encoders by treating each image-text pair as an independent binary classification problem rather than as part of a batch-wide softmax distribution. Introduced by Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer in the 2023 paper "Sigmoid Loss for Language Image Pre-Training", SigLIP replaces the softmax-based contrastive loss that underlies CLIP with a per-pair sigmoid loss that does not require a global view of the similarity matrix.[1] The reformulation reduces the memory cost of large-batch contrastive pre-training, performs better at small batch sizes, and produces vision encoders that have become widely used as the visual backbone in downstream vision-language models. A successor family, SigLIP 2, released by a larger Google DeepMind team led by Michael Tschannen in February 2025, extends the recipe with captioning objectives, self-distillation losses, online data curation, multilingual training, and native aspect-ratio variants.[2]
| Original authors | Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer (Google DeepMind)[1] |
| First release | arXiv preprint 2023-03-27; ICCV 2023[1] |
| Latest version | SigLIP 2 (arXiv 2025-02-20)[2] |
| Model sizes (SigLIP 2) | ViT-B (86M), ViT-L (303M), ViT-So400m (400M), ViT-g (1B)[2][3] |
| Training data | WebLI (10B images, 12B alt-texts, 109 languages)[4][5] |
| Primary loss | Pairwise sigmoid binary cross-entropy with learnable temperature and bias[1] |
| Reference paper (SigLIP) | arXiv:2303.15343[1] |
| Reference paper (SigLIP 2) | arXiv:2502.14786[2] |
| License | Apache 2.0 (model checkpoints on Hugging Face)[3] |
Contrastive language-image pre-training, popularized by OpenAI's CLIP in 2021, learns aligned image and text representations by training paired encoders so that the embedding of an image is close to the embedding of its caption and far from the embeddings of other captions in the same training batch.[1][6] The standard CLIP objective is a symmetric softmax cross-entropy over the matrix of pairwise image-text similarities. For a batch of N image-text pairs, the softmax loss requires computing similarities between every image and every text in the batch, then normalizing each row (image-to-text) and each column (text-to-image) of the similarity matrix. Because both normalizations operate over the full batch, the loss is intrinsically global: doubling the batch size doubles the number of negatives that each positive example contrasts against.[1]
This global-normalization property motivated CLIP-style models to train at very large batch sizes (32k or higher) and motivated systems like OpenAI's CLIP, LAION's OpenCLIP, and EVA-CLIP to invest heavily in distributed contrastive training infrastructure. Two practical issues arose. First, computing and storing the full N x N pairwise similarity matrix at very large N is memory-intensive and bandwidth-intensive in distributed settings, because each device must materialize partial rows and columns from every other device. Second, the optimum batch size for contrastive learning was empirically large but unclear, leaving open whether the benefits of bigger batches were intrinsic to contrastive learning or were artifacts of the softmax formulation.[1][6]
The SigLIP paper, submitted to arXiv on 2023-03-27 and accepted as an oral presentation at the IEEE International Conference on Computer Vision (ICCV) 2023 in Paris, set out to answer those questions by replacing softmax cross-entropy with a per-pair sigmoid loss.[1][7] The work was led by Xiaohua Zhai, then a senior staff research scientist at Google DeepMind, and Lucas Beyer, who together with Alexander Kolesnikov and Basil Mustafa had previously developed the Vision Transformer (ViT) scaling and Locked-image Tuning (LiT) lines of work.[1][7][18] The same team subsequently published the SigLIP 2 paper in February 2025, broadening the recipe and the author list to include Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Ye Xia, Olivier Henaff, Jeremiah Harmsen, and Andreas Steiner.[2]
SigLIP builds on a sequence of vision-language papers from the same research lineage. The most direct predecessor is Locked-image Tuning (LiT), published at CVPR 2022 by Zhai, Wang, Mustafa, Susano Pinto, Alexey Dosovitskiy, Kolesnikov, Andreas Steiner, and Beyer.[18] LiT keeps a strong pretrained vision tower frozen and trains only a text encoder against it with a contrastive softmax loss, achieving a high zero-shot ImageNet accuracy with comparatively modest text-side compute.[18] The SigLIP paper's "SigLiT" variant reuses this idea but swaps the softmax for a sigmoid loss. A second relevant predecessor is PaLI (2022), introduced by Xi Chen, Xiao Wang, Soravit Changpinyo and colleagues, which jointly trained an image encoder and a text decoder on the WebLI dataset and demonstrated competitive performance across 100+ languages.[4] WebLI, the proprietary dataset assembled for PaLI, became the standard pretraining corpus for both SigLIP and SigLIP 2.[2][4]
A complementary thread of work investigated whether captioning losses alone could rival contrastive losses for vision representation learning. CapPa (Image Captioners Are Scalable Vision Learners Too, NeurIPS 2023) by Tschannen, Manoj Kumar, Steiner, Zhai, Neil Houlsby, and Beyer trained a vision encoder paired with a text decoder, alternating autoregressive captioning (25%) and parallel captioning (75%) with all input tokens masked.[19] CapPa matched or surpassed CLIP on a variety of downstream tasks, suggesting that captioning is a strong auxiliary signal even when the goal is a vision-only representation.[19] LocCa (Visual Pretraining with Location-aware Captioners, NeurIPS 2024) by Bo Wan, Tschannen, and collaborators added bounding-box and grounded-captioning losses to CapPa, lifting RefCOCO performance to 88.34% on val and giving the decoder explicit spatial supervision.[20] CapPa and LocCa are explicit components of the SigLIP 2 recipe described below.[2]
The central technical idea of SigLIP is to treat each entry of the N x N image-text similarity matrix as a binary classification problem rather than as a row or column of a softmax distribution. For a batch of N image-text pairs (with images x_i and texts y_j), the model computes cosine similarities s_ij between image embedding f(x_i) and text embedding g(y_j), then applies a sigmoid function sigma(u) = 1 / (1 + exp(-u)) to a scaled and shifted version of those similarities. The label is +1 when i = j (a matched pair) and -1 otherwise, and the loss is the average negative log-likelihood of the sigmoid prediction for every cell:[1]
L = -(1/N) sum_{i,j} log( sigmoid( z_ij * (t * s_ij + b) ) )
where t is a learnable temperature, b is a learnable bias, and z_ij = +1 for matched pairs and -1 for unmatched pairs.[1] The temperature parameter behaves analogously to the temperature in CLIP, and the paper parameterizes it as exp(t_prime) where t_prime is a learnable scalar.[1] The bias term is novel and is initialized to a large negative value (the authors use b = -10) so that early in training the sigmoid output for all pairs is small, because the off-diagonal entries dominate the loss numerically: in a batch of size N there are N positive pairs and N(N-1) negative pairs, and without the bias the gradient signal would be swamped by negatives.[1][21] The negative initialization shifts initial predictions toward 0 (no match), letting the model accumulate evidence for the diagonal positives over training before being pushed to discriminate negatives.[1]
CLIP uses the symmetric InfoNCE loss applied to logits L_ij = t * s_ij:
L_softmax = -(1/2N) sum_i [ log(exp(L_ii) / sum_j exp(L_ij)) + log(exp(L_ii) / sum_j exp(L_ji)) ]
Computing the denominator for any row of the matrix requires the similarities of x_i to every text in the batch, and computing it for any column requires the similarities of y_j to every image in the batch. In a distributed setting both rows and columns are split across devices, so every device must gather partial similarities from every other device. The cost of this all-gather and the memory needed to materialize the full N x N matrix grow as O(N^2), so batch size is bounded by the per-device memory available for the similarity matrix as well as by the model parameters and activations.[1][22]
The sigmoid loss makes each cell (i, j) contribute independently. The gradient of the binary cross-entropy at logit u is sigma(u) - y, where y is the binary target, so per-cell updates can be computed locally on whichever device holds (x_i, y_j) without global communication.[1][21] In the paper the authors show that this enables a "chunked" implementation in which sub-batches of text features are exchanged between devices in a ring, the sigmoid loss is accumulated piece by piece, and per-device memory remains bounded by the local chunk size b^2 rather than the global batch size N^2.[1][22] On four TPUv4 chips, the chunked implementation fit a global batch of 4,096 image-text pairs for a SigLIP-Base model, whereas a comparable CLIP-Base model could only reach a global batch of 2,048 under identical hardware and architecture.[22]
A second consequence of the sigmoid formulation is that it is well-defined for any batch size, including batch size 1, whereas the softmax loss is degenerate at very small batch sizes (a single positive example would contrast against only itself). The SigLIP authors exploited this property to perform a careful study of batch-size dependence, scanning from very small batches up to one million examples.[1][21] Their empirical finding was that both losses improve as batch size grows but that gains saturate near 32k pairs, and that beyond about 256k batch size very large batches can hurt rather than help.[1][21] At small batch sizes (below 16k) the sigmoid loss outperforms softmax by a clear margin, which is the regime relevant for academic and small-industry training runs.[1][8] Specifically, the paper reports that the sigmoid loss reaches peak ImageNet zero-shot accuracy at 32k batch size, while the softmax loss only catches up at a batch size of 98k and still trails the sigmoid result.[1][22] The same 32k saturation point holds for multilingual training on over 100 languages.[1]
The sigmoid loss has also been studied for its robustness properties. Per-pair independence makes the loss less sensitive to mislabeled pairs in noisy web data: a single false positive contributes only to its own cell rather than affecting normalization in every other row and column.[21] Subsequent theoretical work has examined the global minimizers of sigmoid contrastive loss and the geometric properties of the learned embedding space, formalizing the conditions under which sigmoid and softmax contrastive losses yield equivalent representations and the regimes in which they diverge.[23]
SigLiT is the variant in which the image encoder is initialized from a pretrained ViT classifier (typically trained on JFT or ImageNet) and frozen, and only the text encoder is trained from scratch using the sigmoid loss.[1][18] This builds directly on the earlier Locked-image Tuning (LiT) recipe by the same authors, in which the pretrained vision tower acts as a fixed "teacher" feature extractor and the text tower learns to align to its representation space.[18] The SigLIP paper reports that, combined with the sigmoid loss and Locked-image Tuning, a SigLiT model that reuses a g/14 vision checkpoint, trained on four TPUv4 chips for two days, reaches 84.5% zero-shot top-1 accuracy on ImageNet, demonstrating that the sigmoid formulation enables strong contrastive models without large-scale compute.[1][8] A smaller Base/8 + L/16 SigLiT configuration reaches 79.7% on the same setup in one day.[1]
In the full (non-locked) setting both encoders are trained jointly. SigLIP models are typically based on Vision Transformer backbones at Base, Large, and shape-optimized 400M ("So400m") scales. The shape-optimized 400M architecture comes from a separate Google paper, "Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design" (arXiv:2305.13035) by Alabdulmohsin, Zhai, Kolesnikov, and Beyer, which derived the SoViT-400M shape (width 1152, depth 27, MLP dimension 4304) by jointly tuning depth, width, and MLP dimension under compute-optimal scaling laws.[9] SoViT-400M/14 was reported to reach 90.3% fine-tuning accuracy on ImageNet, surpassing the much larger ViT-g/14 and approaching ViT-G/14 at less than half the inference cost.[9] The SigLIP-So400m-patch14-384 checkpoint, the most widely deployed variant of the original SigLIP, was trained on 16 TPUv4 chips for three days and processes 384x384 images as 729 patch tokens of size 14x14 with mean and standard deviation (0.5, 0.5, 0.5) for image normalization.[3] Text inputs are tokenized to a maximum sequence length of 64 tokens.[3]
Both SigLIP and SigLIP 2 were trained on WebLI, a Google-internal web-scale image-text dataset introduced in the 2022 PaLI paper by Chen et al.[4] WebLI contains approximately 10 billion images paired with 12 billion alt-texts spanning 109 languages, drawn from public web pages and filtered for safety, deduplication, and basic quality criteria.[4][5] The original SigLIP was trained primarily on the English subset of WebLI, while SigLIP 2 used a more diverse 90% English plus 10% non-English mixture, which underpins its improved multilingual performance.[2][5] WebLI's image-text pairs are filtered using prior image-text alignment models, and later iterations include both noisy alt-text captions and machine-generated captions.[4][5] SigLIP 2 also adopts the multilingual Gemma tokenizer (vocabulary size 256k), which gives the text encoder full coverage of the language mixture in WebLI rather than the English-only tokenizer used by SigLIP 1.[5]
The non-public nature of WebLI is one of the principal differences between SigLIP and open replications such as OpenCLIP, which train on the public LAION-2B and LAION-5B datasets. Researchers cannot exactly replicate Google's results without access to WebLI, although the SigLIP model checkpoints themselves are released under permissive licenses on Hugging Face.[3][15]
Released on 2025-02-20 as arXiv:2502.14786 and titled "SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features," SigLIP 2 is described by its authors as a "unified recipe" that combines the original sigmoid contrastive objective with several previously independent training signals.[2] The headline additions are captioning objectives, self-distillation losses, online data curation, and multilingual data, with two architectural variants (fixed resolution and NaFlex) for each model size.[2][10][24]
SigLIP 2 trains the standard image and text encoders with the original sigmoid loss, augmented by three additional families of losses:[2][10]
The combined recipe is run on up to 2,048 TPUv5e chips with a fully-sharded data-parallel strategy at a global batch size of 32,000 for a total of 40 billion examples seen.[11] The optimizer is AdamW, the global learning rate uses the standard linear warmup followed by cosine decay, and the training pipeline is implemented in the open-source big_vision codebase that Google Research maintains in JAX.[5][11]
SigLIP 2 releases checkpoints at four sizes (ViT-B with 86 million parameters, ViT-L with 303 million, ViT-So400m with 400 million, and ViT-g with 1 billion) and at multiple input resolutions (commonly 224, 256, 384, and 512 pixels).[2][10] For each of Base, Large, and So400m, two architectural families are released:[10]
The full lineup is therefore: Base/32 at 256, Base/16 at 224, 256, 384, and 512, Large/16 at 256, 384, and 512, SO400M/14 at 224 and 384, SO400M/16 at 256, 384, and 512, and giant/16 at 256 and 384, each in fixed and (where applicable) NaFlex form.[10]
The SigLIP 2 paper reports improvements over the original SigLIP across nearly all evaluation categories. Selected ImageNet-1k zero-shot accuracy numbers from the paper include:[11]
| Setting | SigLIP | SigLIP 2 |
|---|---|---|
| ImageNet-1k zero-shot, ViT-B/16 at 256 | 76.7% | 79.1% |
| ImageNet-1k zero-shot, ViT-L/16 at 256 | n/a | 82.5% |
| ImageNet-1k zero-shot, ViT-So400m/14 at 384 | n/a | 84.1% |
| ImageNet-1k zero-shot, ViT-g/16 at 256 | n/a | 84.5% |
| ImageNet-v2 zero-shot, ViT-L/16 at 256 | 74.2% | 76.8% |
| COCO text-to-image Recall@1, ViT-L/16 at 256 | 81.3% | 84.1% |
| PASCAL semantic segmentation mIoU, ViT-So/14 at 384 | 73.8 | 78.1 |
| NYUv2 depth estimation RMSE, ViT-So/14 (lower is better) | 0.563 | 0.466 |
| Representation bias (lower is better), ViT-L/16 at 256 | 35.5% | 7.3% |
On the Crossmodal-3600 (XM3600) multilingual retrieval benchmark, which covers 36 languages, SigLIP 2 ViT-B/16 at 256 pixels improves recall from 22.5% under SigLIP to 40.7%, approaching the performance of the multilingual mSigLIP variant despite a more balanced 90/10 English/multilingual mixture.[11][28] At ViT-L/16 and 256 pixels, the XM3600 recall@1 reaches 46.5%, and the gap to mSigLIP closes further at larger scales.[11] The paper attributes the localization and dense-prediction improvements primarily to the decoder and self-distillation objectives, and the multilingual and fairness improvements primarily to the data mixture and de-biasing.[2][11]
A second design choice that the paper emphasizes is active distillation through online data curation. During training, training pairs are scored by an EMA copy of the model and re-weighted so that smaller models effectively learn from a curriculum of progressively more informative pairs.[2][24] The authors report that this benefits small variants disproportionately: the Base SigLIP 2 model closes a noticeable fraction of the gap to Large and So400m on ImageNet zero-shot and on retrieval benchmarks, which previous CLIP-style training had not achieved.[2][24]
SigLIP 2 uses the same dual-encoder architecture as SigLIP 1, the same patch sizes, and the same JAX/big_vision training stack, so downstream code that consumed SigLIP-So400m checkpoints can swap in a SigLIP 2-So400m checkpoint without re-engineering the pipeline.[10][15] The main upgrade visible to downstream users is the multilingual Gemma tokenizer (256k vocabulary), which replaces the English-only SentencePiece tokenizer of SigLIP 1 and allows the text encoder to handle non-Latin scripts and multi-byte tokens without re-tuning.[5][10]
SigLIP has been adopted as the default visual encoder in a growing list of open and proprietary vision-language models, often replacing OpenAI's CLIP image tower. The shape-optimized SigLIP-So400m checkpoint is particularly widely used because its 400 million parameters strike a favorable balance between representation quality and inference cost.[3][15]
PaliGemma, released by Google on 2024-05-14 as arXiv:2407.07726, is a 3-billion-parameter vision-language model that pairs a SigLIP-So400m vision encoder with a Gemma 2B language model.[12][29] Images are encoded into patch tokens by SigLIP at 224x224, 448x448, or 896x896, projected by a linear adapter from the SigLIP output dimension of 1152 to the Gemma embedding dimension of 2048, and concatenated with text tokens as input to the Gemma decoder.[12][29] At 224 resolution the model produces 256 image tokens; at 448, 1,024 tokens; at 896, 4,096 tokens. PaliGemma was evaluated on roughly 40 transfer tasks, including standard image captioning and visual question answering benchmarks as well as more specialized domains like remote sensing and segmentation, and serves as a base model for transfer learning rather than as an end-user chatbot.[12][29] PaliGemma 2, released later in 2024, upgrades the decoder to the Gemma 2 family (with 2B, 9B, and 27B variants) while continuing to use SigLIP-family vision encoders.[13][30]
The vision-capable Gemma 3 models (March 2025) use a 400M variant of the SigLIP encoder operating at a fixed resolution of 896x896 and producing a sequence of 256 dense "soft tokens" per image.[31] To handle high-resolution and non-square images without degrading performance, Gemma 3 adds a Pan and Scan algorithm at inference time: the image is segmented into non-overlapping crops of equal size that together cover the image, and each crop is resized to 896x896 and passed through the encoder separately.[31] The Pan and Scan algorithm specifically helps with tasks involving non-square aspect ratios, high-resolution photos, and embedded text reading, addressing a limitation of fixed-resolution SigLIP.[31]
In the open-source community, the LLaVA series of vision-language models migrated from OpenAI CLIP to SigLIP image encoders as their default. LLaVA-OneVision, an August 2024 release in the LLaVA family, uses a SigLIP-So400M encoder paired with a Qwen2 language backbone, and several research VLMs (LLaVA-MORE, LLaVA-NeXT variants, and others) similarly adopt SigLIP-So400m or SigLIP 2 checkpoints as their vision tower.[14] The widespread switch is generally attributed to SigLIP's higher zero-shot quality and to the convenience of the 400M shape-optimized scale, which is small enough to be tractable inside a multimodal pipeline while strong enough to set the state of the art on transfer benchmarks.[14]
The Hugging Face team's Idefics2 (released 2024-05-03) and Idefics3 (released 2024-08-22) open vision-language models both use SigLIP-So400m-patch14-384 as the image tower.[32][33] Idefics2 pairs the encoder with Mistral 7B and supports two training stages: a first stage at SigLIP's native 384x384 and a second stage at native aspect ratios up to 980 pixels on the long side.[32] Idefics3 swaps the language backbone to Llama 3.1 Instruct and adds a pixel-shuffle strategy that reduces visual tokens to 169 while supporting input resolutions up to roughly 364x364 per tile.[33] Both models report Apache 2.0 licenses and are widely used as research baselines for instruction-following vision-language models.[32][33]
The Qwen-VL family from Alibaba initially used CLIP-style encoders but later versions (Qwen2.5-VL and later) shifted toward dynamic-resolution Native-ViT designs influenced by NaViT and SigLIP's NaFlex.[34] While Qwen2-VL itself uses a 675M Vision Transformer trained from scratch rather than a SigLIP checkpoint, the design space in which it operates was shaped by SigLIP's results on smaller batch sizes and the NaFlex/NaViT aspect-ratio handling.[34]
The SmolVLM family from Hugging Face (released January 2025) explicitly targets edge devices and laptops, with 256M and 500M variants that fit in under 1 GB of RAM.[35] SmolVLM-256M uses a reduced 93M-parameter SigLIP vision encoder paired with the SmolLM2 text decoder; the 500M variant uses the full SigLIP-So400M vision tower.[35] Training data is drawn from The Cauldron and Docmatix, weighted toward document understanding and image captioning, demonstrating that SigLIP-based VLMs can be compressed into small footprints suitable for on-device use.[35]
MedSigLIP (released 2025-07-09) is a domain-adapted SigLIP variant from Google Health AI Developer Foundations, consisting of a 400M-parameter vision encoder and 400M-parameter text encoder trained at 448x448 resolution.[36] The training data combines de-identified medical images (chest X-rays, dermatology, ophthalmology, histopathology slides, CT and MRI slices) with natural images and paired text reports, so that the model retains general visual capabilities while gaining domain-specific knowledge.[36] MedSigLIP is distributed through Google Cloud Model Garden and Hugging Face under research-friendly licenses and is recommended for data-efficient classification, zero-shot classification, and semantic image retrieval in medical workflows.[36]
Google's Vertex AI multimodal embeddings API exposes a managed multimodal embedding model (the public multimodalembedding@001 endpoint, producing 1408-dimension vectors) that supports image, video, and text inputs in a shared embedding space.[37] SigLIP-family encoders are also distributed through Vertex AI Model Garden as part of the SigLIP 2, MedSigLIP, and PaliGemma releases, making them deployable as managed endpoints for enterprise customers in addition to the open-weight Hugging Face downloads.[37]
SigLIP and SigLIP 2 vision encoders are integrated into the Hugging Face Transformers library (under the SiglipModel and Siglip2Model classes) and into ecosystems such as timm and OpenCLIP weight conversions, making them straightforward to drop into new vision-language pipelines.[10][15] OpenCLIP includes a SigLIPTask subclass for training SigLIP-style models, and timm hosts converted PyTorch checkpoints under timm/ViT-B-16-SigLIP, timm/ViT-L-16-SigLIP-384, and related identifiers.[38] Image classification and zero-shot retrieval applications using SigLIP have been packaged in Hugging Face Transformers pipelines under the zero-shot-image-classification task. Diffusion-based image generation systems and image-conditioned editing pipelines have experimented with SigLIP image embeddings as an alternative to CLIP image embeddings, though Stable Diffusion 3 and most other major diffusion models continue to rely on CLIP or T5 text encoders rather than SigLIP for text conditioning.[15]
The most direct comparison for SigLIP is OpenAI's CLIP, from which it differs principally in the choice of loss function. Other contrastive vision-language models in the same lineage include OpenCLIP (an open replication of CLIP trained on LAION), EVA-CLIP (a series of scaled CLIP models from BAAI), and DFN (Data Filtering Networks). The following table summarizes some of the distinguishing characteristics:
| Aspect | CLIP[16] | SigLIP[1] | SigLIP 2[2] |
|---|---|---|---|
| Year | 2021 | 2023 | 2025 |
| Authors | OpenAI | Google DeepMind | Google DeepMind |
| Contrastive loss | Symmetric softmax cross-entropy | Pairwise sigmoid with learnable bias | Sigmoid + decoder + self-distillation + curation |
| Loss requires global batch normalization | Yes | No | No |
| Reported optimal batch size | Large (32k+ in practice) | Saturates around 32k; competitive at 8k | 32k (paper standard) |
| Native multilingual training | No (English) | Limited (English-dominant WebLI) | Yes (90/10 English/non-English) |
| Native aspect-ratio variants | No | No | Yes (NaFlex) |
| Public training data | Closed (~400M pairs) | Closed (WebLI) | Closed (WebLI) |
| Largest released vision tower | ViT-L/14 (~300M params) | ViT-So400m, ViT-L | ViT-g (1B) |
| Memory per device (batch N) | O(N^2) | O(b^2) with chunking | O(b^2) with chunking |
| Tokenizer | English-only (49k) | English-only (32k) | Multilingual Gemma (256k) |
| Typical default in 2025 VLMs | Legacy | Common (So400m) | Increasingly common |
The relationship to OpenCLIP is similar to that with CLIP: OpenCLIP uses softmax contrastive loss and public data, while SigLIP uses sigmoid loss and proprietary data.[16] EVA-CLIP scales CLIP-style training to billions of parameters with curated improvements (masked-image-modeling initialization, optimization tricks) but retains the softmax loss.[17] DFN (Data Filtering Networks) is another contemporary contrastive vision-language line that uses CLIP-style loss but emphasizes filtered training data; some DFN checkpoints have been compared head-to-head against SigLIP in vision-language model benchmarks.[14]
The most concise way to see the difference between CLIP and SigLIP is to write both losses on the same batch of logits L_ij = t * s_ij (or t * s_ij + b for SigLIP):
CLIP softmax (image-to-text direction only): -log(exp(L_ii) / sum_j exp(L_ij)).
SigLIP sigmoid: -log(sigma(z_ii * L_ii)) when i = j and -log(sigma(z_ij * L_ij)) when i != j, with z_ii = +1 and z_ij = -1.
The numerator in the softmax depends only on the positive pair, but the denominator depends on the entire row through sum_j exp(L_ij). In SigLIP the loss for cell (i, j) only depends on L_ij, so a gradient update for one cell does not affect any other.[1][21] This is what makes the chunked all-gather implementation and the small-batch behavior possible.[22]
SigLIP's significance has both methodological and practical components.
Methodologically, it demonstrated that softmax normalization is not intrinsic to contrastive learning, and that a per-pair sigmoid loss can match or exceed softmax performance while removing the global-batch synchronization constraint. The result reshaped how researchers think about the relationship between batch size and contrastive learning: where the prevailing wisdom from CLIP was that more negatives are always better, SigLIP showed that this benefit saturates around 32k and can reverse at extreme batch sizes.[1][8] The careful batch-size scan in the paper has become a frequently cited reference in subsequent contrastive learning work, and follow-up papers have analyzed the embedding geometry and theoretical properties of sigmoid contrastive losses.[23]
Practically, SigLIP and SigLIP 2 produced vision encoders that, together with their permissive open-weight release on Hugging Face, became the default image tower for a large fraction of modern open vision-language models. PaliGemma, PaliGemma 2, Gemma 3 VL, LLaVA-OneVision, Idefics2, Idefics3, SmolVLM, and many other VLMs use SigLIP-family checkpoints, and SigLIP 2's NaFlex variants address a long-standing pain point in document and OCR pipelines by removing forced square cropping.[10][12][14][31][32][33][35] The 400M shape-optimized "So400m" backbone in particular has become a kind of industry standard for the vision side of small and mid-scale VLMs, partly because of SigLIP's reported numbers and partly because Gemma and PaliGemma made the checkpoint extremely convenient to consume.[12][29]
The SigLIP 2 paper also surfaced fairness and multilingual concerns that the original SigLIP did not address explicitly. The reported reduction of representation bias from 35.5% to 7.3% (on the L/16 model at 256 sequence length) and the improved performance on Crossmodal-3600 across 36 languages are presented as evidence that contrastive vision-language training can be improved on these axes by changes to data mixture and explicit de-biasing, without sacrificing benchmark accuracy.[11][28]
Several limitations of SigLIP are commonly noted in the literature.
Closed training data. Both SigLIP and SigLIP 2 are trained on WebLI, a Google-internal dataset that has never been released publicly. This prevents external researchers from reproducing or auditing the training pipeline and means that SigLIP cannot be used as a baseline in studies that require fully open training data. Open replications using LAION or DataComp would need to be performed independently and cannot strictly reproduce the published numbers.[4][5]
Limited multilingual coverage in original SigLIP. Although WebLI includes 109 languages, the original SigLIP paper focused on English data, and multilingual retrieval performance was significantly below what SigLIP 2 later achieved.[2][11] Users who needed multilingual coverage before SigLIP 2 had to use the separate mSigLIP variant or other models.
Architectural conservatism. SigLIP retains the dual-encoder architecture from CLIP (a separate vision tower and a separate text tower with no cross-attention between them at training time). This is a deliberate choice in service of fast retrieval, but it caps the kind of fine-grained image-text interaction the model can learn, in contrast to CoCa, BLIP-2, and other models that use cross-attention or generative captioning objectives. SigLIP 2 partly addresses this by adding decoder-based captioning, but the headline checkpoints remain dual-encoder for inference.[2]
Dense and localization features in SigLIP 1. The original SigLIP was trained with a purely global objective (matched-pair classification) and was not optimized for dense per-patch features. The SigLIP 2 paper itself reports that adding decoder objectives and self-distillation losses substantially improves segmentation and depth-estimation transfer numbers (PASCAL mIoU rising from 73.8 to 78.1 on So/14 at 384), implicitly highlighting the dense-feature weakness of the original model.[11]
Batch-size sensitivity. Although the sigmoid loss removes the strict need for very large batches, SigLIP and SigLIP 2 still report best results at batch sizes of tens of thousands and tens of billions of training examples. Reaching the published numbers therefore still requires substantial compute, even if it is less than CLIP-scale.[1][11]
Bias and societal impact. While SigLIP 2 substantially reduces a specific measure of representation bias, the underlying training data is still web-crawled and inherits the biases of the open web. The SigLIP 2 paper acknowledges that explicit de-biasing focuses on a specific set of socially sensitive concepts and is not a general solution to fairness concerns in vision-language models.[2]
Fixed-resolution legacy and document distortion. SigLIP 1 imposes a fixed square resolution per checkpoint (e.g., 384 for SigLIP-So400m-patch14-384), which forces aspect-ratio distortion on non-square images. This is detrimental for document images, screenshots, and OCR tasks. SigLIP 2 NaFlex variants and Gemma 3's Pan and Scan workaround were designed to address this limitation, but it remained the primary motivation for many downstream pipelines to add tiling or letter-boxing layers on top of SigLIP 1 outputs.[10][31]
SigLIP sits at the intersection of contrastive pre-training, the Vision Transformer line, and large multimodal models. Closely related precursors and contemporaries include:
On the downstream side, SigLIP image encoders feed into LLaVA-family and Gemma/Gemma 2/Gemma 3-based multimodal systems such as PaliGemma, as well as a wide range of multimodal AI research models trained on top of frozen SigLIP visual features.[12][13][14][31]