Zero-Shot Image Classification Models

AI Models Computer Vision

19 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

30 citations

Revision

v5 · 3,724 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Zero-shot image classification models are vision systems that assign images to categories the model has never encountered as labeled training examples. Instead of learning a fixed output head for a closed set of classes, these models compare visual embeddings to embeddings of the candidate label names expressed in natural language, then pick the label whose text embedding is closest in a shared representation space. The dominant family in 2025 builds on contrastive vision-language pretraining: CLIP (OpenAI, 2021), ALIGN (Google, 2021), OpenCLIP (LAION, 2022), SigLIP (Google, 2023), MetaCLIP (Meta, 2023), EVA-CLIP (BAAI, 2023), DFN (Apple, 2023), and SigLIP 2 (Google, 2025). The key insight is that web-scale image-text corpora teach a single image encoder and a single text encoder to project matched pairs near each other in a shared embedding space, so the image encoder can classify any image against any set of class names the user writes at inference time.

Zero-shot classification differs from few-shot, which uses a handful of labeled examples per class, and from supervised classification. A supervised ImageNet classifier is locked to its 1000 categories; switching label sets requires new labeled data and retraining. A CLIP model can be evaluated on any benchmark by writing class names as text prompts, with no parameter updates.

Historical context

The term "zero-shot learning" entered computer vision through attribute-based recognition. Lampert, Nickisch, and Harmeling introduced Direct Attribute Prediction (DAP) and Indirect Attribute Prediction (IAP) at CVPR 2009, training classifiers for human-defined visual attributes (such as "has stripes", "is furry") on seen animal classes and transferring those detectors to unseen classes with known attribute profiles.^[16]

Four years later, Frome et al. at Google published DeViSE at NeurIPS 2013, which replaced hand-crafted attributes with continuous label embeddings produced by a Word2Vec model. DeViSE projected image features into the same space as the class-name word vectors, so unseen classes could be recognized by nearest-neighbor lookup over word embedding of their names.^[17] Norouzi et al. extended this with ConSE in 2013, combining a softmax classifier with semantic embeddings to interpolate label vectors for unseen classes. Together with GloVe embeddings (Pennington et al., 2014), these methods set the template that contrastive language-image models would later scale up.

The critical shift came with the availability of web-scale image-text pairs. Prior attribute-based methods required expensive human annotation of class attributes. Moving to natural language supervision meant any internet caption could become a training signal, enabling dataset sizes orders of magnitude larger.

Contrastive language-image pretraining

The modern era begins with two papers in early 2021. ALIGN (Jia et al. at Google, arXiv 2102.05918) trained a dual-encoder model on 1.8 billion noisy image-alt-text pairs scraped from the web with minimal filtering, showing that scale could compensate for noise.^[2] Days later OpenAI released CLIP (Radford et al., arXiv 2103.00020), trained on 400 million curated image-text pairs called WIT. CLIP used a Vision Transformer or ResNet image encoder paired with a Transformer text encoder, optimizing an InfoNCE contrastive learning objective that pulled matched pairs together and pushed mismatched pairs apart in a shared cosine space.^[1]

CLIP produced the first widely available zero-shot classifier that matched ResNet-50 ImageNet accuracy without any ImageNet training data, and generalized across more than 30 benchmarks including OCR, geolocation, and fine-grained categories.^[1] Follow-up work refined the recipe: LiT (Zhai et al., arXiv 2111.07991) froze a strong pretrained image encoder and contrastively trained only the text tower, reaching 85.2% zero-shot ImageNet.^[3] Florence at Microsoft (arXiv 2111.11432) used a CoSwin backbone on FLD-900M.^[4] FILIP (arXiv 2111.07783) modified the loss for token-level similarities, and DeCLIP added self-supervised signals to improve data efficiency.

The InfoNCE objective

The core training loss computes a matrix of cosine similarities between all image-text pairs in a batch. For a batch of N pairs, the loss pushes the N matched pairs to high similarity while pushing the $N(N-1)$ mismatched pairs to low similarity. This is a cross-entropy loss applied symmetrically: once treating images as queries over text, once treating text as queries over images. The temperature parameter controls how sharply peaked the similarity distribution must be. At batch size 32,768 (as used in CLIP), each update compares one image against 32,767 negative text candidates and vice versa, requiring efficient in-batch negatives rather than a maintained memory bank.^[1]

Open-source CLIP variants

While CLIP's weights were released, OpenAI's training data was not. The open-source community responded with OpenCLIP, an open implementation by Ilharco, Wortsman, and colleagues that reproduced and scaled CLIP using the open LAION datasets curated by LAION. The LAION-2B-en subset of LAION-5B became the standard pretraining pool. OpenCLIP released ViT-B, ViT-L, ViT-H/14, ViT-g/14, and ViT-bigG/14 checkpoints; the bigG variant reached 80.1% zero-shot ImageNet top-1, surpassing the original CLIP ViT-L/14.^[19] Cherti et al. studied reproducible scaling laws for these models in arXiv 2212.07143.^[12]

In 2023, Meta released MetaCLIP (Xu et al., arXiv 2309.16671), which reverse-engineered CLIP's data curation procedure: balancing a raw CommonCrawl pool over a metadata distribution derived from CLIP's published vocabulary of substring queries. A MetaCLIP ViT-B/16 on 400M curated pairs reached 70.8% zero-shot ImageNet versus CLIP's 68.3%; scaling to 1B pairs reached 72.4%.^[10]

DataComp (Gadre et al., arXiv 2304.14108, NeurIPS 2023) treated data curation itself as the research variable, fixing the training procedure and benchmarking filtering strategies on 38 downstream zero-shot tasks. The best DataComp-1B baseline trained a ViT-L/14 to 79.2% zero-shot ImageNet, outperforming the original CLIP ViT-L/14 by 3.7 percentage points using identical compute.^[21] DataComp showed that smaller, more stringently filtered datasets can produce models that generalize better than larger noisy pools.

Improved contrastive objectives

Several 2022-2025 papers refined the contrastive objective itself. CoCa (Yu et al., arXiv 2205.01917) combined a contrastive loss with an autoregressive captioning loss on a multimodal decoder; CoCa-large reached 86.3% zero-shot ImageNet.^[5] BLIP (Li et al. at Salesforce, arXiv 2201.12086) introduced caption bootstrapping with a synthetic captioner and a noise filter^[6]; BLIP-2 (arXiv 2301.12597) added a Q-Former that bridged a frozen image encoder to a frozen large language model, outperforming Flamingo on zero-shot VQAv2 with 54 times fewer trainable parameters.^[7]

SigLIP (Zhai et al. at Google, arXiv 2303.15343) replaced the softmax-normalized contrastive loss with a pairwise sigmoid loss applied independently to each pair. The sigmoid formulation removed the global similarity matrix, lowered memory cost, performed better at smaller batch sizes, and matched or exceeded softmax CLIP at scale.^[8] The SigLIP-SO400m variant (400M parameters) became a default vision encoder for many open vision-language models. SigLIP 2 (Tschannen et al., arXiv 2502.14786, February 2025) extended the recipe with captioning losses, self-distillation, masked prediction, and online data curation. SigLIP 2 ships at four sizes (B/86M, L/303M, So400m/400M, g/1B), supports 109 languages, and outperforms the original SigLIP at every scale; the B/16 variant reached 79.1% zero-shot ImageNet at 256px, up from SigLIP's 76.7%.^[14] For multilingual retrieval on XM3600, SigLIP 2 improved performance from 22.5% to 40.7%.^[14]

EVA-CLIP (Sun et al. at BAAI, arXiv 2303.15389) brought masked-image-modeling pretraining and LAMB optimization to CLIP; the 5B-parameter EVA-02-CLIP-E/14+ reached 82.0% zero-shot ImageNet with 9B seen samples^[9], and follow-up work scaled to 18B parameters (arXiv 2402.04252).^[18]

Larger and curated-data models

Apple's Data Filtering Networks paper (Fang et al., arXiv 2309.17425) showed that a small dedicated network trained to score image-text pair quality could produce a 5B-image curated pool from 43B uncurated pairs. A ViT-H trained on DFN-5B reached 84.4% zero-shot ImageNet, beating LAION-2B, DataComp-1B, and OpenAI's WIT.^[11] Apple released DFN-2B and DFN-5B CLIP variants.^[20] AIMv2 (Fini et al., arXiv 2411.14402, November 2024) departed from contrastive training entirely, pairing a vision encoder with a multimodal decoder that autoregressively predicted image patches and text tokens; AIMv2-3B reached 89.5% ImageNet linear probe accuracy.^[15] DINOv2 (Oquab et al., arXiv 2304.07193) trained on 142M curated images without text supervision but produced features that, paired with a text head, support strong open-vocabulary classification.^[13]

How CLIP-style zero-shot classification works

Given a dual-encoder model, classifying an image $x$ against $K$ candidate labels works as follows. The image encoder $f_v$ maps $x$ to a unit-normalized embedding $v$ . For each class name $c_k$ , the text encoder maps a prompt template such as "a photo of a $c_k$ " to a unit-normalized embedding $t_k$ . The predicted class is $\hat{y} = \arg\max_k \langle v, t_k \rangle$ , the cosine similarity argmax. No gradient updates or labeled images for candidate classes are required.^[1]

This mechanism converts the text encoder into a classifier-weight generator. Rather than having a fixed matrix of learned weight vectors, the model generates classifier weights on the fly from natural language. The same image encoder can therefore be used for ImageNet-1k (1000 classes), a custom product taxonomy (500 classes), or an unusual domain-specific ontology, all without any retraining.

Prompt engineering and ensembling

Prompt wording matters substantially. The CLIP paper showed that "a photo of a {class}" lifted ImageNet zero-shot accuracy by about 1.3% over the bare class name, because raw class names rarely appear alone in web captions.^[1] Prompt ensembling, where multiple templates per class are encoded and their text embeddings averaged before the argmax, gave a further 3.5% gain with 80 templates.^[1] Class-name disambiguation matters too: "crane" returns the bird in some prompts and the construction machine in others, so prompts often add a hint ("a photo of a crane, a type of bird").

Several follow-up methods improve beyond hand-crafted templates. CuPL (Pratt et al., 2023) used GPT-3 to generate natural language descriptions of each class, then averaged their text embeddings.^[25] DCLIP (Menon and Vondrick, 2022) prompted a language model for visual discriminative features per class and aggregated them. WaffleCLIP (Roth et al., ICCV 2023, arXiv 2306.07282) showed, perhaps counterintuitively, that appending random words and broad concepts alongside a class name achieves similar gains to LLM-generated descriptions; this suggests that extra tokens aid calibration and context more than semantic content per se.^[26]

Learned prompt methods go a step further. CoOp (Zhou et al., arXiv 2109.01134) replaced the hand-crafted prefix with a set of learnable continuous vectors optimized on a small labeled set of the target domain classes.^[23] CoCoOp (Zhou et al., CVPR 2022, arXiv 2203.05557) extended this with an instance-conditioned token generated by a lightweight meta-network, improving generalization from base classes to unseen classes by over 4 percentage points on average.^[24] These methods blur the line between zero-shot and few-shot, since they require a small labeled set to tune the prompt, but they illustrate how the text-encoder-as-classifier design supports flexible adaptation.

Notable models

Model	Release	Organization	Parameters	Training data	Zero-shot IN-1k
CLIP ViT-L/14^[1]	Jan 2021	OpenAI	428M	WIT 400M	75.5%
ALIGN EfficientNet-L2^[2]	Feb 2021	Google	820M	1.8B noisy pairs	76.4%
LiT ViT-g/14^[3]	Nov 2021	Google	1.0B+	4B pairs	85.2%
Florence^[4]	Nov 2021	Microsoft	893M	FLD-900M	83.7%
OpenCLIP ViT-H/14^[19]	Sep 2022	LAION	986M	LAION-2B-en	78.0%
OpenCLIP ViT-bigG/14^[19]	Mar 2023	LAION	2.5B	LAION-2B-en	80.1%
CoCa-large^[5]	May 2022	Google	787M	JFT-3B	86.3%
BLIP-2 ViT-g^[7]	Jan 2023	Salesforce	1.2B+Q	129M	LLM coupled
DataComp-1B ViT-L/14^[21]	Apr 2023	Multi-institution	428M	DataComp-1B	79.2%
EVA-02-CLIP-E/14+^[9]	Mar 2023	BAAI	5.0B	LAION-2B+COYO	82.0%
SigLIP SO400m^[8]	Mar 2023	Google	400M	WebLI	83.2%
MetaCLIP ViT-H/14^[10]	Sep 2023	Meta	986M	CC 2.5B	80.5%
DFN-5B ViT-H/14^[11]	Nov 2023	Apple	986M	DFN-5B	84.4%
EVA-CLIP-18B^[18]	Feb 2024	BAAI	18.1B	Merged-2B+	83.0%
AIMv2-3B^[15]	Nov 2024	Apple	3.0B	DFN-2B	89.5% (linear)
SigLIP 2 B/16^[14]	Feb 2025	Google	86M	WebLI ml	79.1%
SigLIP 2 g^[14]	Feb 2025	Google	1.0B	WebLI ml	85.0%+

Numbers reflect headline zero-shot ImageNet-1k top-1 accuracy reported by authors; ensembling, resolutions, and prompt sets vary. AIMv2 reports linear probe accuracy rather than zero-shot.

Benchmarks

A zero-shot classifier's value depends on how it generalizes across class taxonomies and distribution shifts. The standard suite used by CLIP, OpenCLIP, SigLIP, and SigLIP 2 covers natural object recognition, fine-grained categorization, satellite imagery, and natural adversarial examples.

Benchmark	Year	Classes	Focus
ImageNet-1k	2015	1000	General object recognition
ImageNet-V2	2019	1000	Distribution shift (resampled test set)
ImageNet-A	2019	200	Naturally adversarial images
ImageNet-R	2020	200	Artistic renditions, sketches, sculptures
ImageNet-Sketch	2019	1000	Black-and-white sketch domain
ObjectNet	2019	313	Crowdsourced novel viewpoints
CIFAR-100	2009	100	Low-resolution objects
Oxford Flowers 102	2008	102	Fine-grained flowers
Food-101	2014	101	Dish recognition
Stanford Cars	2013	196	Vehicle make and model
Country211	2021	211	Geolocation by country
EuroSAT	2019	10	Satellite land cover
RESISC45	2017	45	Aerial remote sensing
ELEVATER	2022	20 tasks	Classification suite with external knowledge

Distribution-shift gaps across IN-V2, IN-A, IN-R, IN-Sketch, and ObjectNet are central to claims about CLIP-family generalization: supervised ImageNet classifiers can lose 40 points moving from IN-1k to these shifted sets, while CLIP-style models typically lose under 10 points. The original CLIP paper showed that all evaluated CLIP models improved "effective robustness" substantially, reducing the gap between in-distribution and out-of-distribution accuracy by up to 75% compared to supervised ResNet baselines with equivalent ImageNet performance.^[1]

ELEVATER

The ELEVATER benchmark (Li et al., NeurIPS 2022, arXiv 2204.08790) provides a structured testbed specifically for language-augmented visual models. It covers 20 image classification datasets spanning diverse domains including natural objects, textures, satellite imagery, and specialized scientific categories, alongside 35 object detection datasets.^[22] Each dataset is augmented with external knowledge from thesauri, dictionaries, and GPT-3-generated descriptions. ELEVATER measures sample efficiency (zero-shot, few-shot, and full-shot) as well as parameter efficiency (linear probing versus full fine-tuning), making it the broadest standardized evaluation for zero-shot image classifiers.

Extending zero-shot classification to detection and segmentation

Zero-shot classification is the foundational capability, but the same vision-language alignment transfers to detection and segmentation tasks. OWL-ViT (Minderer et al., ECCV 2022) removed the final token pooling from a CLIP vision encoder and attached lightweight classification and box regression heads to each patch token, enabling open-vocabulary object detection with natural language queries.^[27] OWLv2 (NeurIPS 2023) scaled this to self-training on web images, reaching strong performance on LVIS and COCO with arbitrary class names.^[28] GLIP (Li et al., 2022) framed detection as phrase grounding and pretrained on combined detection, grounding, and captioning data. Grounded-SAM and subsequent work combine CLIP-family text encoders with the Segment Anything Model to produce open-vocabulary segmentation. These extensions are covered in dedicated articles; see the See also section.

Modern landscape (2024-2025)

Three trends shape the 2024-2025 wave of zero-shot vision. First, SigLIP 2 (February 2025) sets new state-of-the-art numbers at every size class while improving multilingual coverage across 109 languages, becoming the default backbone in many open-weight vision-language models including LLaVA-OneVision and Qwen2-VL successors.^[14] Second, AIMv2 (Apple, late 2024) shows that multimodal autoregressive pretraining can match or beat contrastive training.^[15] Third, data curation has become the dominant lever: MetaCLIP, DFN, and SigLIP 2's online curation all show that quality and balance outweigh sheer scale beyond a point.

Large vision-language models such as Qwen2-VL, InternVL2, and the Apple Intelligence stack typically use a SigLIP or AIMv2 vision tower paired with a language model that ingests visual tokens. Synthetic captions and bootstrapped data are now standard: the BLIP captioner-and-filter cycle, the DFN learned filter, and SigLIP 2's online curation all clean and re-caption a noisy web pool before the main contrastive run consumes it. For more on how zero-shot classification models function as components inside generative vision-language systems, see the dedicated image-to-text models article.

Applications

Zero-shot classifiers are deployed wherever the label set is open or changes frequently. Retail systems index catalog images against text descriptions written by merchandisers without commissioning new labeled training data per season or product line. Content moderation pipelines score user-uploaded images against custom policy categories ("violent imagery", "hate symbol") that operators update in plain text, so policy changes do not require model retraining. Medical triage prototypes flag images with rare findings whose labeled training data is scarce, though clinical deployment requires careful validation; work such as MedCLIP and BioMedCLIP adapts contrastive pretraining to radiology report-image pairs, and research on fairness-aware CLIP variants (AdFair-CLIP, FairerCLIP) specifically targets equitable chest X-ray diagnosis.^[29]

Image search engines combine a CLIP text tower for query encoding with cosine similarity retrieval over precomputed image embeddings, supporting natural-language search such as "a black labrador sleeping on a red couch". Dataset curation tools such as those used for LAION-5B and the DFN pipelines use CLIP scores to filter and deduplicate web pairs. Automated tagging pipelines that previously required a custom CNN per taxonomy now use a single CLIP or SigLIP backbone and rotate the prompt set.

Accessibility tools benefit because a single zero-shot model can generate category labels across any domain without domain-specific retraining. Security and surveillance applications employ CLIP-style classifiers for scene understanding with flexible, operator-defined event categories. Geospatial analysis uses EuroSAT and RESISC45 as zero-shot benchmarks, and practitioners have deployed CLIP for satellite image tagging tasks that would previously have required specialized annotated datasets.

Limitations

Despite their flexibility, zero-shot classifiers have persistent weaknesses. They are sensitive to prompt wording: switching from "a photo of a {class}" to "a {class}" can move accuracy by several points, and class-name ambiguity often requires hand-crafted context. Probability calibration is poor, complicating threshold-based filtering. Fine-grained categories such as bird species, plant cultivars, and specific car trims remain difficult because web captions rarely use exact scientific or model names. Counting and spatial relations are weak across the CLIP family, since the contrastive objective rewards holistic image-text matching rather than compositional structure.

Web-scale training data inherits its biases. CLIP audits have documented disparate misclassification rates by gender and race, including higher misclassification of photographs of Black individuals into non-human categories and gendered associations with occupations and crime. LAION-400M-based models have been shown to disproportionately associate Muslim, Black, and immigrant identities with toxic prompts. Research on multilingual CLIP checkpoints has further revealed that language resource level, grammatical gender in source languages, and architectural choices jointly shape bias patterns across ten typologically diverse languages.^[30] Scaling alone does not guarantee fairness; larger models can amplify these biases when the underlying data distribution is imbalanced. SigLIP 2 explicitly added de-biasing steps to its data mix.^[14]

Evaluation reliability is itself an open problem: small changes in class taxonomy, prompt template, or preprocessing can shift reported numbers by several points. Strongly supervised baselines still win on narrow tasks with clean labels such as medical imaging panels and industrial defect detection; zero-shot classifiers excel at breadth rather than as drop-in replacements in high-stakes settings. The requirement for a text encoder at inference time adds memory and latency compared to a single-model classifier, though this cost is often amortized by precomputing text embeddings for a fixed label set.

References

Radford, A., et al. (2021). "Learning Transferable Visual Models From Natural Language Supervision." arXiv:2103.00020. https://arxiv.org/abs/2103.00020 ↩
Jia, C., et al. (2021). "Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision." arXiv:2102.05918. https://arxiv.org/abs/2102.05918 ↩
Zhai, X., et al. (2021). "LiT: Zero-Shot Transfer with Locked-image text Tuning." arXiv:2111.07991. https://arxiv.org/abs/2111.07991 ↩
Yuan, L., et al. (2021). "Florence: A New Foundation Model for Computer Vision." arXiv:2111.11432. https://arxiv.org/abs/2111.11432 ↩
Yu, J., et al. (2022). "CoCa: Contrastive Captioners are Image-Text Foundation Models." arXiv:2205.01917. https://arxiv.org/abs/2205.01917 ↩
Li, J., et al. (2022). "BLIP: Bootstrapping Language-Image Pre-training." arXiv:2201.12086. https://arxiv.org/abs/2201.12086 ↩
Li, J., et al. (2023). "BLIP-2." arXiv:2301.12597. https://arxiv.org/abs/2301.12597 ↩
Zhai, X., et al. (2023). "Sigmoid Loss for Language Image Pre-Training." arXiv:2303.15343. https://arxiv.org/abs/2303.15343 ↩
Sun, Q., et al. (2023). "EVA-CLIP: Improved Training Techniques for CLIP at Scale." arXiv:2303.15389. https://arxiv.org/abs/2303.15389 ↩
Xu, H., et al. (2023). "Demystifying CLIP Data" (MetaCLIP). arXiv:2309.16671. https://arxiv.org/abs/2309.16671 ↩
Fang, A., et al. (2023). "Data Filtering Networks." arXiv:2309.17425. https://arxiv.org/abs/2309.17425 ↩
Cherti, M., et al. (2023). "Reproducible scaling laws for contrastive language-image learning." arXiv:2212.07143. https://arxiv.org/abs/2212.07143 ↩
Oquab, M., et al. (2023). "DINOv2." arXiv:2304.07193. https://arxiv.org/abs/2304.07193 ↩
Tschannen, M., et al. (2025). "SigLIP 2." arXiv:2502.14786. https://arxiv.org/abs/2502.14786 ↩
Fini, E., et al. (2024). "Multimodal Autoregressive Pre-training of Large Vision Encoders" (AIMv2). arXiv:2411.14402. https://arxiv.org/abs/2411.14402 ↩
Lampert, C. H., Nickisch, H., Harmeling, S. (2009). "Learning to detect unseen object classes by between-class attribute transfer." CVPR 2009. https://ieeexplore.ieee.org/document/5206594 ↩
Frome, A., et al. (2013). "DeViSE: A Deep Visual-Semantic Embedding Model." NeurIPS 2013. https://papers.nips.cc/paper/2013/hash/7cce53cf90577442771720a370c3c723-Abstract.html ↩
Sun, Q., et al. (2024). "EVA-CLIP-18B." arXiv:2402.04252. https://arxiv.org/abs/2402.04252 ↩
OpenCLIP. https://github.com/mlfoundations/open_clip ↩
Apple ML Research. "Data Filtering Networks." https://machinelearning.apple.com/research/data-filtering-networks ↩
Gadre, S. Y., et al. (2023). "DataComp: In search of the next generation of multimodal datasets." arXiv:2304.14108. https://arxiv.org/abs/2304.14108 ↩
Li, C., et al. (2022). "ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models." NeurIPS 2022. arXiv:2204.08790. https://arxiv.org/abs/2204.08790 ↩
Zhou, K., Yang, J., Loy, C. C., Liu, Z. (2022). "Learning to Prompt for Vision-Language Models" (CoOp). arXiv:2109.01134. https://arxiv.org/abs/2109.01134 ↩
Zhou, K., et al. (2022). "Conditional Prompt Learning for Vision-Language Models" (CoCoOp). CVPR 2022. arXiv:2203.05557. https://arxiv.org/abs/2203.05557 ↩
Pratt, S., et al. (2023). "What Does a Platypus Look Like? Generating Customized Prompts for Zero-Shot Image Classification" (CuPL). ICCV 2023. https://arxiv.org/abs/2209.03320 ↩
Roth, K., et al. (2023). "Waffling around for Performance: Visual Classification with Random Words and Broad Concepts" (WaffleCLIP). ICCV 2023. arXiv:2306.07282. https://arxiv.org/abs/2306.07282 ↩
Minderer, M., et al. (2022). "Simple Open-Vocabulary Object Detection with Vision Transformers" (OWL-ViT). ECCV 2022. https://arxiv.org/abs/2205.06230 ↩
Minderer, M., et al. (2023). "Scaling Open-Vocabulary Object Detection" (OWLv2). NeurIPS 2023. https://arxiv.org/abs/2306.09683 ↩
Dehdashtian, S., et al. (2024). "FairerCLIP: Debiasing CLIP's Zero-Shot Predictions in RKHS." ICLR 2024. https://arxiv.org/abs/2305.13673 ↩
Navigli, R., et al. (2023). "Breaking Language Barriers or Reinforcing Bias? Gender and Racial Disparities in Multilingual Contrastive Vision Language Models." arXiv:2505.14160. https://arxiv.org/abs/2505.14160 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit

What links here

SigLIP Zero-Shot Classification Models

Historical context

Contrastive language-image pretraining

The InfoNCE objective

Open-source CLIP variants

Improved contrastive objectives

Larger and curated-data models

How CLIP-style zero-shot classification works

Prompt engineering and ensembling

Notable models

Benchmarks

ELEVATER

Extending zero-shot classification to detection and segmentation

Modern landscape (2024-2025)

Applications

Limitations

See also

References

Improve this article

Related Articles

Image-to-Image Models

Image Classification Models

Segment Anything Model and Dataset (SAM and SA-1B)

Unconditional Image Generation Models

Video Classification Models

Visual Question Answering Models

What links here

Related Articles

Image-to-Image Models

Image Classification Models

Segment Anything Model and Dataset (SAM and SA-1B)

Unconditional Image Generation Models

Video Classification Models

Visual Question Answering Models

What links here