Zero-Shot Classification Models
Last reviewed
May 11, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 ยท 2,500 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 ยท 2,500 words
Add missing citations, update stale details, or suggest a clearer explanation.
Zero-shot classification models are machine learning systems that assign input text to a set of candidate categories without having seen labeled training examples for those specific categories. Instead of per-class supervised data, these models use auxiliary information about the labels themselves: the natural language form of a category name, an entailment relation between text and a class-name hypothesis, or a shared embedding space for input and labels. Within natural language processing, zero-shot classification covers topic labeling, sentiment categorization, intent detection, customer support routing, news triage, and content moderation.
This article focuses on zero-shot classification for text. Models that perform similar generalization for images are covered in Zero-Shot Image Classification Models. The related setting in which a handful of labeled examples are provided per class is discussed at Few-Shot Learning.
See also: Natural Language Processing Models and Tasks
In standard supervised classification a model is trained on a fixed label set Y and learns a direct mapping from text x to a probability distribution over y in Y. Zero-shot classification removes the assumption that the test label set was observed during training: the model receives a new label set Z at inference time and must produce calibrated scores for each z in Z using only the natural language form of the labels and optional descriptions.
Three settings sit on a spectrum. Supervised classification has many labeled examples per class. Few-Shot Learning provides between one and roughly fifty examples per class, often through in-context prompting. Zero-shot classification provides zero per-class examples, with the model relying on prior knowledge from pretraining plus the label text itself. A variant called generalized zero-shot classification mixes seen and unseen classes at test time and is harder than pure zero-shot, because models tend to be biased toward classes encountered during fine-tuning.
The term zero-shot learning was popularized by computer vision research in the late 2000s. Hugo Larochelle and collaborators introduced zero-data learning in 2008, showing that a network trained on a task description could solve unseen variants. Mark Palatucci, Dean Pomerleau, Geoffrey Hinton, and Tom Mitchell formalized the idea for classification in their 2009 paper "Zero-shot Learning with Semantic Output Codes," which decoded fMRI signals into words that had no training examples by routing predictions through a semantic feature space. Later vision work, including Yongqin Xian and collaborators' "Zero-Shot Learning, the Good, the Bad and the Ugly," defined the generalized zero-shot setup and benchmarked attribute-based methods.
Text classification followed a similar arc. Early systems combined word embeddings such as Word2vec and GloVe with class-label vectors to score documents by cosine similarity. Pretrained transformer encoders made it practical to recast classification as another language task, leading to two breakthroughs that define modern text zero-shot classification: the entailment reformulation of Yin, Hay, and Roth in 2019, and the in-context prompting capability of GPT-3 in 2020.
Contemporary zero-shot text classification falls into three broad approaches: natural language inference, prompted generative language models, and embedding similarity.
The natural language inference approach treats classification as an entailment problem. The input text is the premise, and for each candidate label the system constructs a short hypothesis sentence using a template such as "This text is about politics" or "This example expresses anger." A pretrained NLI classifier scores the probability that the premise entails the hypothesis, and the label with the highest entailment probability is returned. This formulation was introduced for text by Wenpeng Yin, Jamaal Hay, and Dan Roth in their 2019 EMNLP paper "Benchmarking Zero-shot Text Classification: Datasets, Evaluation and Entailment Approach," arXiv 1909.00161. They showed that an NLI model fine-tuned on MultiNLI could classify topics, emotions, and situations on datasets it had never been trained on, often matching more complex methods of the time.
The NLI approach is widespread on the Hugging Face Hub through the zero-shot-classification pipeline. The most downloaded such model, facebook/bart-large-mnli, uses the BART encoder-decoder fine-tuned on MultiNLI and produces three logits per pair for entailment, neutral, and contradiction. For single-label problems the entailment scores across labels are softmaxed together; for multi-label problems each label is scored independently.
The second paradigm uses a large language model such as GPT-3, GPT-4, Claude, or Llama 3 to perform classification directly through a natural language prompt. A typical prompt provides the input text, a list of candidate labels with optional definitions, and an instruction to return the most appropriate label. The seminal paper is Tom Brown and collaborators' "Language Models are Few-Shot Learners," arXiv 2005.14165, which introduced GPT-3 and showed that very large language models can solve a wide range of classification and reasoning tasks with no gradient updates. Instruction-tuned models such as Flan-T5 by Hyung Won Chung and collaborators, arXiv 2210.11416, made zero-shot prompting reliable on smaller open models by training on thousands of task descriptions.
Prompted classification is the most flexible paradigm because the model can read multi-sentence class definitions and handle hierarchical taxonomies. It is also the most expensive per call and the most sensitive to prompt phrasing.
The third paradigm computes a sentence embedding for the input and a separate embedding for each candidate label, then assigns the label whose embedding has the highest cosine similarity with the input. Sentence-BERT, introduced by Nils Reimers and Iryna Gurevych in their 2019 EMNLP paper "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks," arXiv 1908.10084, gave a fast way to produce semantically meaningful sentence vectors at scale. Modern embedding APIs from OpenAI, Cohere, and open models such as BAAI BGE are commonly used in this setting.
Embedding similarity is the cheapest paradigm at inference time and is often used as a baseline or as a first-stage retriever. Its accuracy on abstract or contrastive labels tends to be lower than NLI or LLM methods, because the model has no explicit signal that labels should be treated as decision boundaries rather than just topics.
The following table lists widely used models.
| Model | Year | Developer | Size | Approach | Key trait |
|---|---|---|---|---|---|
| BART large MNLI (facebook/bart-large-mnli) | 2019 | Meta AI | 0.4B | NLI entailment | Default Hugging Face zero-shot pipeline, 2.9M+ monthly downloads |
| RoBERTa large MNLI | 2019 | Meta AI | 0.355B | NLI entailment | Strong NLI baseline in early benchmarks |
| DeBERTa v3 large MNLI Fever ANLI Ling WANLI | 2022 | Moritz Laurer | 0.435B | NLI entailment | Trained on 885k NLI pairs; top NLI score on Hugging Face Hub in 2022 |
| TARS | 2020 | Halder, Akbik et al. | 0.110B | Universal binary classifier | COLING 2020; shipped in the Flair NLP library |
| GPT-3 | 2020 | OpenAI | 175B | Prompted LLM | First wide demonstration of zero-shot prompting at scale |
| GPT-4 | 2023 | OpenAI | undisclosed | Prompted LLM | High-accuracy reference classifier |
| Claude 3 family | 2024 | Anthropic | undisclosed | Prompted LLM | Long context suits multi-paragraph documents |
| Flan T5 large and XXL | 2022 | Google Research | 0.78B and 11B | Instruction-tuned LLM | Open-weight, strong on classification and reasoning |
| Llama 3 instruction tuned | 2024 | Meta AI | 8B and 70B | Prompted LLM | Open-weight self-hosted option |
| Sentence-BERT all-mpnet-base-v2 | 2020 | UKP Lab | 0.110B | Embedding similarity | Common baseline for embedding-based classification |
Most text zero-shot evaluation reuses topic, sentiment, and intent corpora from the supervised literature, splitting the label set so that some classes are held out. The most cited benchmark in the NLI lineage is the Yin et al. 2019 suite, which evaluates on topic, emotion, and situation datasets with explicit zero-shot splits.
| Dataset | Domain | Classes | Test size | Typical use |
|---|---|---|---|---|
| Yahoo Answers Topics | Q&A forum | 10 | 60,000 | Topic zero-shot benchmark in Yin et al. 2019 |
| AG News | Newswire | 4 | 7,600 | News topic classification |
| DBpedia 14 | Wikipedia abstracts | 14 | 70,000 | Fine-grained topic classification |
| Emotion (CARER) | 6 | 2,000 | Emotion classification | |
| ISEAR | Self-report responses | 7 | 1,533 | Cross-cultural emotion classification |
| 20 Newsgroups | Newsgroup posts | 20 | 7,532 | Long-running topic baseline |
| BoolQ | Wikipedia passages | 2 | 3,270 | Yes-or-no probe |
| RTE | News and Wikipedia | 2 | 3,000 | Textual entailment in SuperGLUE |
| SciCite | Citation contexts | 3 | 1,861 | Citation intent classification |
| Topical-Chat | Grounded dialogue | 8 | varies | Conversation topic detection |
| MultiNLI | Multi-genre pairs | 3 | 19,648 | Training data for NLI entailment models |
| ANLI | Adversarial NLI | 3 | 3,200 | Stress test for entailment classifiers |
Holistic benchmarks such as MMLU, HELM, and BIG-bench provide indirect signals of LLM zero-shot competence through many multiple choice problems with unseen labels.
The three paradigms have distinct cost and accuracy profiles. NLI-based classifiers are the smallest and fastest: a 0.4 billion parameter BART or DeBERTa NLI model can score thousands of premise-hypothesis pairs per second on a single GPU, which suits batch labeling and topic-style problems. Entailment probabilities are calibrated and can be thresholded for selective prediction.
Prompted LLM classifiers offer the strongest accuracy on hard categories, including those that require multi-sentence definitions, contextual reasoning, or knowledge that a 400 million parameter encoder does not contain. The cost in compute is one to three orders of magnitude higher than NLI for equivalent accuracy on simple problems, but the gap reverses on complex categories where a small NLI model fails outright.
Embedding similarity is the cheapest and is often used as a fast first stage. For 10,000 candidate labels and one input, dot products against precomputed label embeddings are faster than 10,000 NLI calls. Accuracy lags the other paradigms on standalone topic problems, but pure embedding scoring is competitive on concrete domains and pairs well with NLI or LLM reranking. A practical guideline is to start with NLI for cheap labels, escalate to an LLM for difficult decisions, and use embeddings to scale to taxonomies with thousands of candidate labels.
The hypothesis template chosen for an NLI classifier has a measurable effect on accuracy. The default in the Hugging Face zero-shot-classification pipeline is "This example is {label}." Yin and collaborators reported that small phrasing changes can move topic classification accuracy by several points on Yahoo Answers, and affective templates such as "This text expresses {emotion}" outperform generic ones. Templates may include definitional content, for example "This message is a complaint about a product or service," which performs closer to LLM prompts on nuanced categories.
For LLM prompts the design choices include the wording of the instruction, the order in which labels are listed, and whether definitions are included. Multi-label classification adds another decision: in the NLI formulation, multi-label inference runs each label hypothesis independently and applies an entailment threshold, while single-label inference normalizes scores across labels with a softmax.
In the generalized variant, the test label set Z includes both labels seen during fine-tuning and labels that were never seen. This is harder than pure zero-shot because classifiers tend to assign higher confidence to seen labels and starve unseen labels of probability mass. The text counterpart appears most often in intent detection, where a chatbot must recognize both trained intents and emerging intents discovered from logs. Approaches include rescaling NLI logits for unseen labels and combining NLI scores with a retrieval signal from labeled support sets.
Zero-shot classification is a default option in settings where labeled data is missing, expensive, or rapidly changing.
Zero-shot classification has well-documented failure modes. Hypothesis sensitivity means accuracy varies with template wording, so practitioners must test multiple templates on a small validation set before trusting a deployment. Calibration is often poor, especially for multi-label settings where decisions depend on absolute scores rather than relative ranks. Performance degrades on abstract, technical, or domain-specific categories that have little support in the NLI or LLM pretraining data.
Language coverage is uneven; most public NLI checkpoints are trained on English corpora such as MultiNLI and ANLI. Pretrained NLI models also inherit dataset artifacts such as hypothesis-only biases. LLM classifiers can be expensive at high request volumes, and outputs may drift across model versions, which complicates reproducibility. Evaluation itself is challenging because conventional metrics assume fixed label sets and matched train-test distributions, neither of which holds in zero-shot settings.
Zero-shot classification is rarely competitive with a well-resourced supervised classifier when sufficient labeled data exists. The standard workflow is to use zero-shot to bootstrap an annotator, then move to a fine-tuned encoder once data has accumulated.