Zero-Shot Image Classification Models
Last reviewed
May 11, 2026
Sources
20 citations
Review status
Source-backed
Revision
v2 ยท 2,490 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
20 citations
Review status
Source-backed
Revision
v2 ยท 2,490 words
Add missing citations, update stale details, or suggest a clearer explanation.
Zero-shot image classification models are vision systems that assign images to categories the model has never encountered as labeled training examples. Instead of learning a fixed output head for a closed set of classes, these models compare visual embeddings to embeddings of the candidate label names expressed in natural language, then pick the label whose text embedding is closest in a shared representation space. The dominant family in 2025 builds on contrastive vision-language pretraining: CLIP (OpenAI, 2021), ALIGN (Google, 2021), OpenCLIP (LAION, 2022), SigLIP (Google, 2023), MetaCLIP (Meta, 2023), EVA-CLIP (BAAI, 2023), DFN (Apple, 2023), and SigLIP 2 (Google, 2025). The key insight is that web-scale image-text corpora teach a single image encoder and a single text encoder to project matched pairs near each other in a shared embedding space, so the image encoder can classify any image against any set of class names the user writes at inference time.
Zero-shot classification differs from few-shot, which uses a handful of labeled examples per class, and from supervised classification. A supervised ImageNet classifier is locked to its 1000 categories; switching label sets requires new labeled data and retraining. A CLIP model can be evaluated on any benchmark by writing class names as text prompts, with no parameter updates.
The term "zero-shot learning" entered computer vision through attribute-based recognition. Lampert, Nickisch, and Harmeling introduced Direct Attribute Prediction (DAP) and Indirect Attribute Prediction (IAP) at CVPR 2009, training classifiers for human-defined visual attributes (such as "has stripes", "is furry") on seen animal classes and transferring those detectors to unseen classes with known attribute profiles.
Four years later, Frome et al. at Google published DeViSE at NeurIPS 2013, which replaced hand-crafted attributes with continuous label embeddings produced by a Word2Vec model. DeViSE projected image features into the same space as the class-name word vectors, so unseen classes could be recognized by nearest-neighbor lookup over word embedding of their names. Norouzi et al. extended this with ConSE in 2013, combining a softmax classifier with semantic embeddings to interpolate label vectors for unseen classes. Together with GloVe embeddings (Pennington et al., 2014), these methods set the template that contrastive language-image models would later scale up.
The modern era begins with two papers in early 2021. ALIGN (Jia et al. at Google, arXiv 2102.05918) trained a dual-encoder model on 1.8 billion noisy image-alt-text pairs scraped from the web with minimal filtering, showing that scale could compensate for noise. Days later OpenAI released CLIP (Radford et al., arXiv 2103.00020), trained on 400 million curated image-text pairs called WIT. CLIP used a Vision Transformer or ResNet image encoder paired with a Transformer text encoder, optimizing an InfoNCE contrastive objective that pulled matched pairs together and pushed mismatched pairs apart in a shared cosine space.
CLIP produced the first widely available zero-shot classifier that matched ResNet-50 ImageNet accuracy without any ImageNet training data, and generalized across more than 30 benchmarks including OCR, geolocation, and fine-grained categories. Follow-up work refined the recipe: LiT (Zhai et al., arXiv 2111.07991) froze a strong pretrained image encoder and contrastively trained only the text tower, reaching 85.2% zero-shot ImageNet. Florence at Microsoft (arXiv 2111.11432) used a CoSwin backbone on FLD-900M. FILIP (arXiv 2111.07783) modified the loss for token-level similarities, and DeCLIP added self-supervised signals to improve data efficiency.
While CLIP's weights were released, OpenAI's training data was not. The open-source community responded with OpenCLIP, an open implementation by Ilharco, Wortsman, and colleagues that reproduced and scaled CLIP using the open LAION datasets curated by LAION. The LAION-2B-en subset of LAION-5B became the standard pretraining pool. OpenCLIP released ViT-B, ViT-L, ViT-H/14, ViT-g/14, and ViT-bigG/14 checkpoints; the bigG variant reached 80.1% zero-shot ImageNet top-1, surpassing the original CLIP ViT-L/14. Cherti et al. studied reproducible scaling laws for these models in arXiv 2212.07143.
In 2023, Meta released MetaCLIP (Xu et al., arXiv 2309.16671), which reverse-engineered CLIP's data curation procedure: balancing a raw CommonCrawl pool over a metadata distribution derived from CLIP's published vocabulary of substring queries. A MetaCLIP ViT-B/16 on 400M curated pairs reached 70.8% zero-shot ImageNet versus CLIP's 68.3%; scaling to 1B pairs reached 72.4%.
Several 2022-2025 papers refined the contrastive objective itself. CoCa (Yu et al., arXiv 2205.01917) combined a contrastive loss with an autoregressive captioning loss on a multimodal decoder; CoCa-large reached 86.3% zero-shot ImageNet. BLIP (Li et al. at Salesforce, arXiv 2201.12086) introduced caption bootstrapping with a synthetic captioner and a noise filter; BLIP-2 (arXiv 2301.12597) added a Q-Former that bridged a frozen image encoder to a frozen large language model, outperforming Flamingo on zero-shot VQAv2 with 54 times fewer trainable parameters.
SigLIP (Zhai et al. at Google, arXiv 2303.15343) replaced the softmax-normalized contrastive loss with a pairwise sigmoid loss applied independently to each pair. The sigmoid formulation removed the global similarity matrix, lowered memory cost, performed better at smaller batch sizes, and matched or exceeded softmax CLIP at scale. The SigLIP-SO400m variant (400M parameters) became a default vision encoder for many open vision-language models. SigLIP 2 (Tschannen et al., arXiv 2502.14786, February 2025) extended the recipe with captioning losses, self-distillation, masked prediction, and online data curation. SigLIP 2 ships at four sizes (B/86M, L/303M, So400m/400M, g/1B), is multilingual, and outperforms the original SigLIP at every scale.
EVA-CLIP (Sun et al. at BAAI, arXiv 2303.15389) brought masked-image-modeling pretraining and LAMB optimization to CLIP; the 5B-parameter EVA-02-CLIP-E/14+ reached 82.0% zero-shot ImageNet with 9B seen samples, and follow-up work scaled to 18B parameters (arXiv 2402.04252).
Apple's Data Filtering Networks paper (Fang et al., arXiv 2309.17425) showed that a small dedicated network trained to score image-text pair quality could produce a 5B-image curated pool from 43B uncurated pairs. A ViT-H trained on DFN-5B reached 84.4% zero-shot ImageNet, beating LAION-2B, DataComp-1B, and OpenAI's WIT. Apple released DFN-2B and DFN-5B CLIP variants. AIMv2 (Fini et al., arXiv 2411.14402, November 2024) departed from contrastive training entirely, pairing a vision encoder with a multimodal decoder that autoregressively predicted image patches and text tokens; AIMv2-3B reached 89.5% ImageNet linear probe accuracy. DINOv2 (Oquab et al., arXiv 2304.07193) trained on 142M curated images without text supervision but produced features that, paired with a text head, support strong open-vocabulary classification.
Given a dual-encoder model, classifying an image $x$ against $K$ candidate labels works as follows. The image encoder $f_v$ maps $x$ to a unit-normalized embedding $v$. For each class name $c_k$, the text encoder maps a prompt template such as "a photo of a $c_k$" to a unit-normalized embedding $t_k$. The predicted class is $\hat{y} = \arg\max_k \langle v, t_k \rangle$, the cosine similarity argmax. No gradient updates or labeled images for candidate classes are required.
Prompt wording matters. The CLIP paper showed that "a photo of a {class}" lifted ImageNet zero-shot accuracy by about 1.3% over the bare class name, because raw class names rarely appear alone in web captions. Prompt ensembling, where multiple templates per class are encoded and their text embeddings averaged before the argmax, gave a further 3.5% gain with 80 templates. Class-name disambiguation matters too: "crane" returns the bird in some prompts and the construction machine in others, so prompts often add a hint ("a photo of a crane, a type of bird").
| Model | Release | Organization | Parameters | Training data | Zero-shot IN-1k |
|---|---|---|---|---|---|
| CLIP ViT-L/14 | Jan 2021 | OpenAI | 428M | WIT 400M | 75.5% |
| ALIGN EfficientNet-L2 | Feb 2021 | 820M | 1.8B noisy pairs | 76.4% | |
| LiT ViT-g/14 | Nov 2021 | 1.0B+ | 4B pairs | 85.2% | |
| Florence | Nov 2021 | Microsoft | 893M | FLD-900M | 83.7% |
| OpenCLIP ViT-H/14 | Sep 2022 | LAION | 986M | LAION-2B-en | 78.0% |
| OpenCLIP ViT-bigG/14 | Mar 2023 | LAION | 2.5B | LAION-2B-en | 80.1% |
| CoCa-large | May 2022 | 787M | JFT-3B | 86.3% | |
| BLIP-2 ViT-g | Jan 2023 | Salesforce | 1.2B+Q | 129M | LLM coupled |
| EVA-02-CLIP-E/14+ | Mar 2023 | BAAI | 5.0B | LAION-2B+COYO | 82.0% |
| SigLIP SO400m | Mar 2023 | 400M | WebLI | 83.2% | |
| MetaCLIP ViT-H/14 | Sep 2023 | Meta | 986M | CC 2.5B | 80.5% |
| DFN-5B ViT-H/14 | Nov 2023 | Apple | 986M | DFN-5B | 84.4% |
| EVA-CLIP-18B | Feb 2024 | BAAI | 18.1B | Merged-2B+ | 83.0% |
| AIMv2-3B | Nov 2024 | Apple | 3.0B | DFN-2B | 89.5% (linear) |
| SigLIP 2 g | Feb 2025 | 1.0B | WebLI ml | 85.0%+ |
Numbers reflect headline zero-shot ImageNet-1k top-1 accuracy reported by authors; ensembling, resolutions, and prompt sets vary.
A zero-shot classifier's value depends on how it generalizes across class taxonomies and distribution shifts. The standard suite used by CLIP, OpenCLIP, SigLIP, and SigLIP 2 covers natural object recognition, fine-grained categorization, satellite imagery, and natural adversarial examples.
| Benchmark | Year | Classes | Focus |
|---|---|---|---|
| ImageNet-1k | 2015 | 1000 | General object recognition |
| ImageNet-V2 | 2019 | 1000 | Distribution shift (resampled test set) |
| ImageNet-A | 2019 | 200 | Naturally adversarial images |
| ImageNet-R | 2020 | 200 | Artistic renditions, sketches, sculptures |
| ImageNet-Sketch | 2019 | 1000 | Black-and-white sketch domain |
| ObjectNet | 2019 | 313 | Crowdsourced novel viewpoints |
| CIFAR-100 | 2009 | 100 | Low-resolution objects |
| Oxford Flowers 102 | 2008 | 102 | Fine-grained flowers |
| Food-101 | 2014 | 101 | Dish recognition |
| Stanford Cars | 2013 | 196 | Vehicle make and model |
| Country211 | 2021 | 211 | Geolocation by country |
| EuroSAT | 2019 | 10 | Satellite land cover |
| RESISC45 | 2017 | 45 | Aerial remote sensing |
| ELEVATER | 2022 | 20 tasks | Classification suite |
Distribution-shift gaps across IN-V2, IN-A, IN-R, IN-Sketch, and ObjectNet are central to claims about CLIP-family generalization: supervised ImageNet classifiers can lose 40 points moving from IN-1k to these shifted sets, while CLIP-style models typically lose under 10 points.
Three trends shape the 2024-2025 wave of zero-shot vision. First, SigLIP 2 (February 2025) sets new state-of-the-art numbers at every size class while improving multilingual coverage, becoming the default backbone in many open-weight vision-language models including LLaVA-OneVision and Qwen2-VL successors. Second, AIMv2 (Apple, late 2024) shows that multimodal autoregressive pretraining can match or beat contrastive training. Third, data curation has become the dominant lever: MetaCLIP, DFN, and SigLIP 2's online curation all show that quality and balance outweigh sheer scale beyond a point.
Large vision-language models such as Qwen2-VL, InternVL2, and the Apple Intelligence stack typically use a SigLIP or AIMv2 vision tower paired with a language model that ingests visual tokens. Synthetic captions and bootstrapped data are now standard: the BLIP captioner-and-filter cycle, the DFN learned filter, and SigLIP 2's online curation all clean and re-caption a noisy web pool before the main contrastive run consumes it.
Zero-shot classifiers are deployed wherever the label set is open or changes frequently. Retail systems index catalog images against text descriptions written by merchandisers. Content moderation pipelines score user-uploaded images against custom policy categories ("violent imagery", "hate symbol") that operators update in plain text. Medical triage prototypes flag images with rare findings whose labeled training data is scarce, though clinical deployment requires careful validation. Image search engines combine a CLIP text tower for query encoding with cosine similarity retrieval over precomputed image embeddings, supporting natural-language search such as "a black labrador sleeping on a red couch". Dataset curation tools such as those used for LAION-5B and the DFN pipelines use CLIP scores to filter and deduplicate web pairs. Automated tagging pipelines that previously required a custom CNN per taxonomy now use a single CLIP or SigLIP backbone and rotate the prompt set.
Despite their flexibility, zero-shot classifiers have persistent weaknesses. They are sensitive to prompt wording: switching from "a photo of a {class}" to "a {class}" can move accuracy by several points, and class-name ambiguity often requires hand-crafted context. Probability calibration is poor, complicating threshold-based filtering. Fine-grained categories such as bird species, plant cultivars, and specific car trims remain difficult because web captions rarely use exact scientific or model names. Counting and spatial relations are weak across the CLIP family, since the contrastive objective rewards holistic image-text matching rather than compositional structure.
Web-scale training data inherits its biases. CLIP audits have documented disparate misclassification rates by gender and race, including higher misclassification of photographs of Black individuals into non-human categories and gendered associations with occupations and crime. SigLIP 2 explicitly added de-biasing steps to its data mix. Extending CLIP-style models to more languages can improve retrieval while expanding the surface area for harmful stereotypes. Evaluation reliability is itself an open problem: small changes in class taxonomy, prompt template, or preprocessing can shift reported numbers by several points. Strongly supervised baselines still win on narrow tasks with clean labels such as medical imaging panels and industrial defect detection; zero-shot classifiers excel at breadth rather than as drop-in replacements in high-stakes settings.