Zero-Shot Image Classification Models
Last reviewed
May 31, 2026
Sources
30 citations
Review status
Source-backed
Revision
v3 ยท 3,724 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
30 citations
Review status
Source-backed
Revision
v3 ยท 3,724 words
Add missing citations, update stale details, or suggest a clearer explanation.
Zero-shot image classification models are vision systems that assign images to categories the model has never encountered as labeled training examples. Instead of learning a fixed output head for a closed set of classes, these models compare visual embeddings to embeddings of the candidate label names expressed in natural language, then pick the label whose text embedding is closest in a shared representation space. The dominant family in 2025 builds on contrastive vision-language pretraining: CLIP (OpenAI, 2021), ALIGN (Google, 2021), OpenCLIP (LAION, 2022), SigLIP (Google, 2023), MetaCLIP (Meta, 2023), EVA-CLIP (BAAI, 2023), DFN (Apple, 2023), and SigLIP 2 (Google, 2025). The key insight is that web-scale image-text corpora teach a single image encoder and a single text encoder to project matched pairs near each other in a shared embedding space, so the image encoder can classify any image against any set of class names the user writes at inference time.
Zero-shot classification differs from few-shot, which uses a handful of labeled examples per class, and from supervised classification. A supervised ImageNet classifier is locked to its 1000 categories; switching label sets requires new labeled data and retraining. A CLIP model can be evaluated on any benchmark by writing class names as text prompts, with no parameter updates.
The term "zero-shot learning" entered computer vision through attribute-based recognition. Lampert, Nickisch, and Harmeling introduced Direct Attribute Prediction (DAP) and Indirect Attribute Prediction (IAP) at CVPR 2009, training classifiers for human-defined visual attributes (such as "has stripes", "is furry") on seen animal classes and transferring those detectors to unseen classes with known attribute profiles.
Four years later, Frome et al. at Google published DeViSE at NeurIPS 2013, which replaced hand-crafted attributes with continuous label embeddings produced by a Word2Vec model. DeViSE projected image features into the same space as the class-name word vectors, so unseen classes could be recognized by nearest-neighbor lookup over word embedding of their names. Norouzi et al. extended this with ConSE in 2013, combining a softmax classifier with semantic embeddings to interpolate label vectors for unseen classes. Together with GloVe embeddings (Pennington et al., 2014), these methods set the template that contrastive language-image models would later scale up.
The critical shift came with the availability of web-scale image-text pairs. Prior attribute-based methods required expensive human annotation of class attributes. Moving to natural language supervision meant any internet caption could become a training signal, enabling dataset sizes orders of magnitude larger.
The modern era begins with two papers in early 2021. ALIGN (Jia et al. at Google, arXiv 2102.05918) trained a dual-encoder model on 1.8 billion noisy image-alt-text pairs scraped from the web with minimal filtering, showing that scale could compensate for noise. Days later OpenAI released CLIP (Radford et al., arXiv 2103.00020), trained on 400 million curated image-text pairs called WIT. CLIP used a Vision Transformer or ResNet image encoder paired with a Transformer text encoder, optimizing an InfoNCE contrastive learning objective that pulled matched pairs together and pushed mismatched pairs apart in a shared cosine space.
CLIP produced the first widely available zero-shot classifier that matched ResNet-50 ImageNet accuracy without any ImageNet training data, and generalized across more than 30 benchmarks including OCR, geolocation, and fine-grained categories. Follow-up work refined the recipe: LiT (Zhai et al., arXiv 2111.07991) froze a strong pretrained image encoder and contrastively trained only the text tower, reaching 85.2% zero-shot ImageNet. Florence at Microsoft (arXiv 2111.11432) used a CoSwin backbone on FLD-900M. FILIP (arXiv 2111.07783) modified the loss for token-level similarities, and DeCLIP added self-supervised signals to improve data efficiency.
The core training loss computes a matrix of cosine similarities between all image-text pairs in a batch. For a batch of N pairs, the loss pushes the N matched pairs to high similarity while pushing the N(N-1) mismatched pairs to low similarity. This is a cross-entropy loss applied symmetrically: once treating images as queries over text, once treating text as queries over images. The temperature parameter controls how sharply peaked the similarity distribution must be. At batch size 32,768 (as used in CLIP), each update compares one image against 32,767 negative text candidates and vice versa, requiring efficient in-batch negatives rather than a maintained memory bank.
While CLIP's weights were released, OpenAI's training data was not. The open-source community responded with OpenCLIP, an open implementation by Ilharco, Wortsman, and colleagues that reproduced and scaled CLIP using the open LAION datasets curated by LAION. The LAION-2B-en subset of LAION-5B became the standard pretraining pool. OpenCLIP released ViT-B, ViT-L, ViT-H/14, ViT-g/14, and ViT-bigG/14 checkpoints; the bigG variant reached 80.1% zero-shot ImageNet top-1, surpassing the original CLIP ViT-L/14. Cherti et al. studied reproducible scaling laws for these models in arXiv 2212.07143.
In 2023, Meta released MetaCLIP (Xu et al., arXiv 2309.16671), which reverse-engineered CLIP's data curation procedure: balancing a raw CommonCrawl pool over a metadata distribution derived from CLIP's published vocabulary of substring queries. A MetaCLIP ViT-B/16 on 400M curated pairs reached 70.8% zero-shot ImageNet versus CLIP's 68.3%; scaling to 1B pairs reached 72.4%.
DataComp (Gadre et al., arXiv 2304.14108, NeurIPS 2023) treated data curation itself as the research variable, fixing the training procedure and benchmarking filtering strategies on 38 downstream zero-shot tasks. The best DataComp-1B baseline trained a ViT-L/14 to 79.2% zero-shot ImageNet, outperforming the original CLIP ViT-L/14 by 3.7 percentage points using identical compute. DataComp showed that smaller, more stringently filtered datasets can produce models that generalize better than larger noisy pools.
Several 2022-2025 papers refined the contrastive objective itself. CoCa (Yu et al., arXiv 2205.01917) combined a contrastive loss with an autoregressive captioning loss on a multimodal decoder; CoCa-large reached 86.3% zero-shot ImageNet. BLIP (Li et al. at Salesforce, arXiv 2201.12086) introduced caption bootstrapping with a synthetic captioner and a noise filter; BLIP-2 (arXiv 2301.12597) added a Q-Former that bridged a frozen image encoder to a frozen large language model, outperforming Flamingo on zero-shot VQAv2 with 54 times fewer trainable parameters.
SigLIP (Zhai et al. at Google, arXiv 2303.15343) replaced the softmax-normalized contrastive loss with a pairwise sigmoid loss applied independently to each pair. The sigmoid formulation removed the global similarity matrix, lowered memory cost, performed better at smaller batch sizes, and matched or exceeded softmax CLIP at scale. The SigLIP-SO400m variant (400M parameters) became a default vision encoder for many open vision-language models. SigLIP 2 (Tschannen et al., arXiv 2502.14786, February 2025) extended the recipe with captioning losses, self-distillation, masked prediction, and online data curation. SigLIP 2 ships at four sizes (B/86M, L/303M, So400m/400M, g/1B), supports 109 languages, and outperforms the original SigLIP at every scale; the B/16 variant reached 79.1% zero-shot ImageNet at 256px, up from SigLIP's 76.7%. For multilingual retrieval on XM3600, SigLIP 2 improved performance from 22.5% to 40.7%.
EVA-CLIP (Sun et al. at BAAI, arXiv 2303.15389) brought masked-image-modeling pretraining and LAMB optimization to CLIP; the 5B-parameter EVA-02-CLIP-E/14+ reached 82.0% zero-shot ImageNet with 9B seen samples, and follow-up work scaled to 18B parameters (arXiv 2402.04252).
Apple's Data Filtering Networks paper (Fang et al., arXiv 2309.17425) showed that a small dedicated network trained to score image-text pair quality could produce a 5B-image curated pool from 43B uncurated pairs. A ViT-H trained on DFN-5B reached 84.4% zero-shot ImageNet, beating LAION-2B, DataComp-1B, and OpenAI's WIT. Apple released DFN-2B and DFN-5B CLIP variants. AIMv2 (Fini et al., arXiv 2411.14402, November 2024) departed from contrastive training entirely, pairing a vision encoder with a multimodal decoder that autoregressively predicted image patches and text tokens; AIMv2-3B reached 89.5% ImageNet linear probe accuracy. DINOv2 (Oquab et al., arXiv 2304.07193) trained on 142M curated images without text supervision but produced features that, paired with a text head, support strong open-vocabulary classification.
Given a dual-encoder model, classifying an image $x$ against $K$ candidate labels works as follows. The image encoder $f_v$ maps $x$ to a unit-normalized embedding $v$. For each class name $c_k$, the text encoder maps a prompt template such as "a photo of a $c_k$" to a unit-normalized embedding $t_k$. The predicted class is $\hat{y} = \arg\max_k \langle v, t_k \rangle$, the cosine similarity argmax. No gradient updates or labeled images for candidate classes are required.
This mechanism converts the text encoder into a classifier-weight generator. Rather than having a fixed matrix of learned weight vectors, the model generates classifier weights on the fly from natural language. The same image encoder can therefore be used for ImageNet-1k (1000 classes), a custom product taxonomy (500 classes), or an unusual domain-specific ontology, all without any retraining.
Prompt wording matters substantially. The CLIP paper showed that "a photo of a {class}" lifted ImageNet zero-shot accuracy by about 1.3% over the bare class name, because raw class names rarely appear alone in web captions. Prompt ensembling, where multiple templates per class are encoded and their text embeddings averaged before the argmax, gave a further 3.5% gain with 80 templates. Class-name disambiguation matters too: "crane" returns the bird in some prompts and the construction machine in others, so prompts often add a hint ("a photo of a crane, a type of bird").
Several follow-up methods improve beyond hand-crafted templates. CuPL (Pratt et al., 2023) used GPT-3 to generate natural language descriptions of each class, then averaged their text embeddings. DCLIP (Menon and Vondrick, 2022) prompted a language model for visual discriminative features per class and aggregated them. WaffleCLIP (Roth et al., ICCV 2023, arXiv 2306.07282) showed, perhaps counterintuitively, that appending random words and broad concepts alongside a class name achieves similar gains to LLM-generated descriptions; this suggests that extra tokens aid calibration and context more than semantic content per se.
Learned prompt methods go a step further. CoOp (Zhou et al., arXiv 2109.01134) replaced the hand-crafted prefix with a set of learnable continuous vectors optimized on a small labeled set of the target domain classes. CoCoOp (Zhou et al., CVPR 2022, arXiv 2203.05557) extended this with an instance-conditioned token generated by a lightweight meta-network, improving generalization from base classes to unseen classes by over 4 percentage points on average. These methods blur the line between zero-shot and few-shot, since they require a small labeled set to tune the prompt, but they illustrate how the text-encoder-as-classifier design supports flexible adaptation.
| Model | Release | Organization | Parameters | Training data | Zero-shot IN-1k |
|---|---|---|---|---|---|
| CLIP ViT-L/14 | Jan 2021 | OpenAI | 428M | WIT 400M | 75.5% |
| ALIGN EfficientNet-L2 | Feb 2021 | 820M | 1.8B noisy pairs | 76.4% | |
| LiT ViT-g/14 | Nov 2021 | 1.0B+ | 4B pairs | 85.2% | |
| Florence | Nov 2021 | Microsoft | 893M | FLD-900M | 83.7% |
| OpenCLIP ViT-H/14 | Sep 2022 | LAION | 986M | LAION-2B-en | 78.0% |
| OpenCLIP ViT-bigG/14 | Mar 2023 | LAION | 2.5B | LAION-2B-en | 80.1% |
| CoCa-large | May 2022 | 787M | JFT-3B | 86.3% | |
| BLIP-2 ViT-g | Jan 2023 | Salesforce | 1.2B+Q | 129M | LLM coupled |
| DataComp-1B ViT-L/14 | Apr 2023 | Multi-institution | 428M | DataComp-1B | 79.2% |
| EVA-02-CLIP-E/14+ | Mar 2023 | BAAI | 5.0B | LAION-2B+COYO | 82.0% |
| SigLIP SO400m | Mar 2023 | 400M | WebLI | 83.2% | |
| MetaCLIP ViT-H/14 | Sep 2023 | Meta | 986M | CC 2.5B | 80.5% |
| DFN-5B ViT-H/14 | Nov 2023 | Apple | 986M | DFN-5B | 84.4% |
| EVA-CLIP-18B | Feb 2024 | BAAI | 18.1B | Merged-2B+ | 83.0% |
| AIMv2-3B | Nov 2024 | Apple | 3.0B | DFN-2B | 89.5% (linear) |
| SigLIP 2 B/16 | Feb 2025 | 86M | WebLI ml | 79.1% | |
| SigLIP 2 g | Feb 2025 | 1.0B | WebLI ml | 85.0%+ |
Numbers reflect headline zero-shot ImageNet-1k top-1 accuracy reported by authors; ensembling, resolutions, and prompt sets vary. AIMv2 reports linear probe accuracy rather than zero-shot.
A zero-shot classifier's value depends on how it generalizes across class taxonomies and distribution shifts. The standard suite used by CLIP, OpenCLIP, SigLIP, and SigLIP 2 covers natural object recognition, fine-grained categorization, satellite imagery, and natural adversarial examples.
| Benchmark | Year | Classes | Focus |
|---|---|---|---|
| ImageNet-1k | 2015 | 1000 | General object recognition |
| ImageNet-V2 | 2019 | 1000 | Distribution shift (resampled test set) |
| ImageNet-A | 2019 | 200 | Naturally adversarial images |
| ImageNet-R | 2020 | 200 | Artistic renditions, sketches, sculptures |
| ImageNet-Sketch | 2019 | 1000 | Black-and-white sketch domain |
| ObjectNet | 2019 | 313 | Crowdsourced novel viewpoints |
| CIFAR-100 | 2009 | 100 | Low-resolution objects |
| Oxford Flowers 102 | 2008 | 102 | Fine-grained flowers |
| Food-101 | 2014 | 101 | Dish recognition |
| Stanford Cars | 2013 | 196 | Vehicle make and model |
| Country211 | 2021 | 211 | Geolocation by country |
| EuroSAT | 2019 | 10 | Satellite land cover |
| RESISC45 | 2017 | 45 | Aerial remote sensing |
| ELEVATER | 2022 | 20 tasks | Classification suite with external knowledge |
Distribution-shift gaps across IN-V2, IN-A, IN-R, IN-Sketch, and ObjectNet are central to claims about CLIP-family generalization: supervised ImageNet classifiers can lose 40 points moving from IN-1k to these shifted sets, while CLIP-style models typically lose under 10 points. The original CLIP paper showed that all evaluated CLIP models improved "effective robustness" substantially, reducing the gap between in-distribution and out-of-distribution accuracy by up to 75% compared to supervised ResNet baselines with equivalent ImageNet performance.
The ELEVATER benchmark (Li et al., NeurIPS 2022, arXiv 2204.08790) provides a structured testbed specifically for language-augmented visual models. It covers 20 image classification datasets spanning diverse domains including natural objects, textures, satellite imagery, and specialized scientific categories, alongside 35 object detection datasets. Each dataset is augmented with external knowledge from thesauri, dictionaries, and GPT-3-generated descriptions. ELEVATER measures sample efficiency (zero-shot, few-shot, and full-shot) as well as parameter efficiency (linear probing versus full fine-tuning), making it the broadest standardized evaluation for zero-shot image classifiers.
Zero-shot classification is the foundational capability, but the same vision-language alignment transfers to detection and segmentation tasks. OWL-ViT (Minderer et al., ECCV 2022) removed the final token pooling from a CLIP vision encoder and attached lightweight classification and box regression heads to each patch token, enabling open-vocabulary object detection with natural language queries. OWLv2 (NeurIPS 2023) scaled this to self-training on web images, reaching strong performance on LVIS and COCO with arbitrary class names. GLIP (Li et al., 2022) framed detection as phrase grounding and pretrained on combined detection, grounding, and captioning data. Grounded-SAM and subsequent work combine CLIP-family text encoders with the Segment Anything Model to produce open-vocabulary segmentation. These extensions are covered in dedicated articles; see the See also section.
Three trends shape the 2024-2025 wave of zero-shot vision. First, SigLIP 2 (February 2025) sets new state-of-the-art numbers at every size class while improving multilingual coverage across 109 languages, becoming the default backbone in many open-weight vision-language models including LLaVA-OneVision and Qwen2-VL successors. Second, AIMv2 (Apple, late 2024) shows that multimodal autoregressive pretraining can match or beat contrastive training. Third, data curation has become the dominant lever: MetaCLIP, DFN, and SigLIP 2's online curation all show that quality and balance outweigh sheer scale beyond a point.
Large vision-language models such as Qwen2-VL, InternVL2, and the Apple Intelligence stack typically use a SigLIP or AIMv2 vision tower paired with a language model that ingests visual tokens. Synthetic captions and bootstrapped data are now standard: the BLIP captioner-and-filter cycle, the DFN learned filter, and SigLIP 2's online curation all clean and re-caption a noisy web pool before the main contrastive run consumes it. For more on how zero-shot classification models function as components inside generative vision-language systems, see the dedicated image-to-text models article.
Zero-shot classifiers are deployed wherever the label set is open or changes frequently. Retail systems index catalog images against text descriptions written by merchandisers without commissioning new labeled training data per season or product line. Content moderation pipelines score user-uploaded images against custom policy categories ("violent imagery", "hate symbol") that operators update in plain text, so policy changes do not require model retraining. Medical triage prototypes flag images with rare findings whose labeled training data is scarce, though clinical deployment requires careful validation; work such as MedCLIP and BioMedCLIP adapts contrastive pretraining to radiology report-image pairs, and research on fairness-aware CLIP variants (AdFair-CLIP, FairerCLIP) specifically targets equitable chest X-ray diagnosis.
Image search engines combine a CLIP text tower for query encoding with cosine similarity retrieval over precomputed image embeddings, supporting natural-language search such as "a black labrador sleeping on a red couch". Dataset curation tools such as those used for LAION-5B and the DFN pipelines use CLIP scores to filter and deduplicate web pairs. Automated tagging pipelines that previously required a custom CNN per taxonomy now use a single CLIP or SigLIP backbone and rotate the prompt set.
Accessibility tools benefit because a single zero-shot model can generate category labels across any domain without domain-specific retraining. Security and surveillance applications employ CLIP-style classifiers for scene understanding with flexible, operator-defined event categories. Geospatial analysis uses EuroSAT and RESISC45 as zero-shot benchmarks, and practitioners have deployed CLIP for satellite image tagging tasks that would previously have required specialized annotated datasets.
Despite their flexibility, zero-shot classifiers have persistent weaknesses. They are sensitive to prompt wording: switching from "a photo of a {class}" to "a {class}" can move accuracy by several points, and class-name ambiguity often requires hand-crafted context. Probability calibration is poor, complicating threshold-based filtering. Fine-grained categories such as bird species, plant cultivars, and specific car trims remain difficult because web captions rarely use exact scientific or model names. Counting and spatial relations are weak across the CLIP family, since the contrastive objective rewards holistic image-text matching rather than compositional structure.
Web-scale training data inherits its biases. CLIP audits have documented disparate misclassification rates by gender and race, including higher misclassification of photographs of Black individuals into non-human categories and gendered associations with occupations and crime. LAION-400M-based models have been shown to disproportionately associate Muslim, Black, and immigrant identities with toxic prompts. Research on multilingual CLIP checkpoints has further revealed that language resource level, grammatical gender in source languages, and architectural choices jointly shape bias patterns across ten typologically diverse languages. Scaling alone does not guarantee fairness; larger models can amplify these biases when the underlying data distribution is imbalanced. SigLIP 2 explicitly added de-biasing steps to its data mix.
Evaluation reliability is itself an open problem: small changes in class taxonomy, prompt template, or preprocessing can shift reported numbers by several points. Strongly supervised baselines still win on narrow tasks with clean labels such as medical imaging panels and industrial defect detection; zero-shot classifiers excel at breadth rather than as drop-in replacements in high-stakes settings. The requirement for a text encoder at inference time adds memory and latency compared to a single-model classifier, though this cost is often amortized by precomputing text embeddings for a fixed label set.