Zero-Shot Classification Models
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 ยท 4,208 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 ยท 4,208 words
Add missing citations, update stale details, or suggest a clearer explanation.
Zero-shot classification models are machine learning systems that assign input text to a set of candidate categories without having seen labeled training examples for those specific categories. Instead of per-class supervised data, these models use auxiliary information about the labels themselves: the natural language form of a category name, an entailment relation between text and a class-name hypothesis, or a shared embedding space for input and labels. Within natural language processing, zero-shot classification covers topic labeling, sentiment categorization, intent detection, customer support routing, news triage, and content moderation.
This article focuses on zero-shot classification for text. Models that perform similar generalization for images are covered in Zero-Shot Image Classification Models. The related setting in which a handful of labeled examples are provided per class is discussed at Few-Shot Learning. See also Text Classification Models for supervised and semi-supervised approaches to the same tasks.
See also: Natural Language Processing Models and Tasks
In standard supervised classification a model is trained on a fixed label set Y and learns a direct mapping from text x to a probability distribution over y in Y. Zero-shot classification removes the assumption that the test label set was observed during training: the model receives a new label set Z at inference time and must produce calibrated scores for each z in Z using only the natural language form of the labels and optional descriptions.
Three settings sit on a spectrum. Supervised classification has many labeled examples per class. Few-Shot Learning provides between one and roughly fifty examples per class, often through in-context prompting. Zero-shot classification provides zero per-class examples, with the model relying on prior knowledge from pretraining plus the label text itself. A variant called generalized zero-shot classification mixes seen and unseen classes at test time and is harder than pure zero-shot, because models tend to be biased toward classes encountered during fine-tuning.
The term zero-shot learning was popularized by computer vision research in the late 2000s. Hugo Larochelle and collaborators introduced zero-data learning in 2008, showing that a network trained on a task description could solve unseen variants. Mark Palatucci, Dean Pomerleau, Geoffrey Hinton, and Tom Mitchell formalized the idea for classification in their 2009 paper "Zero-shot Learning with Semantic Output Codes," which decoded fMRI signals into words that had no training examples by routing predictions through a semantic feature space. Later vision work, including Yongqin Xian and collaborators' "Zero-Shot Learning, the Good, the Bad and the Ugly," defined the generalized zero-shot setup and benchmarked attribute-based methods.
Text classification followed a similar arc. Early systems combined word embeddings such as Word2vec and GloVe with class-label vectors to score documents by cosine similarity. Pretrained transformer encoders made it practical to recast classification as another language task, leading to two breakthroughs that define modern text zero-shot classification: the entailment reformulation of Yin, Hay, and Roth in 2019, and the in-context prompting capability of GPT-3 in 2020.
The period from 2021 to 2023 saw rapid consolidation. The BART-large-MNLI model became the dominant open default for practitioners through its integration into the Hugging Face Transformers zero-shot-classification pipeline. Moritz Laurer's 2022 series of DeBERTa-v3 checkpoints fine-tuned on stacked NLI corpora pushed entailment-based accuracy further. Instruction-tuned models such as Flan-T5 demonstrated that training on task-description templates generalized reliably to unseen classification tasks. By 2025, the BTZSC benchmark (arXiv:2603.11991, accepted at ICLR 2026) systematically compared 38 models across 22 datasets and found that modern reranker-class models had overtaken traditional NLI cross-encoders, while instruction-tuned LLMs at 4 to 12 billion parameters had narrowed the gap from below.
Contemporary zero-shot text classification falls into three broad approaches: natural language inference, prompted generative language models, and embedding similarity.
The natural language inference approach treats classification as an entailment problem. The input text is the premise, and for each candidate label the system constructs a short hypothesis sentence using a template such as "This text is about politics" or "This example expresses anger." A pretrained NLI classifier scores the probability that the premise entails the hypothesis, and the label with the highest entailment probability is returned. This formulation was introduced for text by Wenpeng Yin, Jamaal Hay, and Dan Roth in their 2019 EMNLP paper "Benchmarking Zero-shot Text Classification: Datasets, Evaluation and Entailment Approach," arXiv 1909.00161. They showed that an NLI model fine-tuned on MultiNLI could classify topics, emotions, and situations on datasets it had never been trained on, often matching more complex methods of the time.
The mechanics of the entailment scoring step are worth unpacking. An NLI model produces three logits per premise-hypothesis pair: entailment, neutral, and contradiction. For single-label classification the entailment scores across all candidate labels are passed through a softmax together, turning them into a probability distribution over labels. For multi-label classification each label is scored independently and an entailment threshold is applied, because labels are not mutually exclusive and a shared softmax would suppress co-occurring categories.
The NLI approach is widespread on the Hugging Face Hub through the zero-shot-classification pipeline. The most downloaded such model, facebook/bart-large-mnli, uses the BART encoder-decoder fine-tuned on MultiNLI and produces three logits per pair for entailment, neutral, and contradiction. The model uses a bidirectional encoder paired with an autoregressive decoder, and the MNLI fine-tuning stage adapts this architecture to the three-class inference task. For single-label problems the entailment scores across labels are softmaxed together; for multi-label problems each label is scored independently.
DeBERTa-v3-large-mnli-fever-anli-ling-wanli, released by Moritz Laurer in 2022, extended this approach by stacking five NLI corpora: MultiNLI, Fever-NLI, Adversarial NLI (ANLI), LingNLI, and WANLI, totaling 885,242 hypothesis-premise pairs. The result was the top-ranked NLI model on the Hugging Face Hub as of June 2022 and achieved state-of-the-art scores on the ANLI adversarial benchmark. At roughly 435 million parameters and throughput of approximately 980 text pairs per second on an A100 GPU, it offered the strongest available accuracy-throughput balance for NLI-based zero-shot classification at the time.
The second paradigm uses a large language model such as GPT-3, GPT-4, Claude, or Llama 3 to perform classification directly through a natural language prompt. A typical prompt provides the input text, a list of candidate labels with optional definitions, and an instruction to return the most appropriate label. The seminal paper is Tom Brown and collaborators' "Language Models are Few-Shot Learners," arXiv 2005.14165, which introduced GPT-3 and showed that very large language models can solve a wide range of classification and reasoning tasks with no gradient updates. Instruction-tuned models such as Flan-T5 by Hyung Won Chung and collaborators, arXiv 2210.11416, made zero-shot prompting reliable on smaller open models by training on thousands of task descriptions.
Prompted classification is the most flexible paradigm because the model can read multi-sentence class definitions and handle hierarchical taxonomies. It is also the most expensive per call and the most sensitive to prompt phrasing. A 2023 IBM study published at EMNLP Findings ("Zero-shot Topical Text Classification with LLMs: an Experimental Study") evaluated LLMs including Flan-T5-XXL across 23 topical classification datasets and found that task-specific fine-tuning on top of instruction-tuned bases could improve further still, suggesting that pure prompting leaves accuracy on the table for well-defined topic taxonomies. A 2024 study (arXiv:2406.08660) confirmed that fine-tuned smaller models still outperform zero-shot LLMs by roughly 10 to 25 percentage points on fine-grained intent and topic classification, though the gap narrows on coarse-grained tasks where label definitions are unambiguous.
The third paradigm computes a sentence embedding for the input and a separate embedding for each candidate label, then assigns the label whose embedding has the highest cosine similarity with the input. Sentence-BERT, introduced by Nils Reimers and Iryna Gurevych in their 2019 EMNLP paper "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks," arXiv 1908.10084, gave a fast way to produce semantically meaningful sentence vectors at scale. Modern embedding APIs from OpenAI, Cohere, and open models such as BAAI BGE are commonly used in this setting.
Embedding similarity is the cheapest paradigm at inference time and is often used as a baseline or as a first-stage retriever. Its accuracy on abstract or contrastive labels tends to be lower than NLI or LLM methods, because the model has no explicit signal that labels should be treated as decision boundaries rather than just topics. The 2026 BTZSC benchmark found that strong embedding models offer the best accuracy-to-latency trade-off across its 22 datasets, making them a practical first choice for high-volume pipelines where latency budgets are tight. Reranker-class models, which apply a cross-attention scoring step similar to NLI cross-encoders but trained explicitly for ranking rather than entailment, achieved the highest overall accuracy in the same benchmark.
The following table lists widely used models.
| Model | Year | Developer | Size | Approach | Key trait |
|---|---|---|---|---|---|
| BART large MNLI (facebook/bart-large-mnli) | 2019 | Meta AI | 0.4B | NLI entailment | Default Hugging Face zero-shot pipeline, 2.9M+ monthly downloads |
| RoBERTa large MNLI | 2019 | Meta AI | 0.355B | NLI entailment | Strong NLI baseline in early benchmarks |
| DeBERTa v3 large MNLI Fever ANLI Ling WANLI | 2022 | Moritz Laurer | 0.435B | NLI entailment | Trained on 885k NLI pairs; top NLI score on Hugging Face Hub in 2022; ~980 pairs/sec on A100 |
| mDeBERTa v3 base MNLI XNLI | 2022 | Moritz Laurer | 0.185B | Multilingual NLI entailment | Supports 100 languages via cross-lingual transfer; trained on XNLI and multilingual-NLI-26lang-2mil7 |
| TARS | 2020 | Halder, Akbik et al. | 0.110B | Universal binary classifier | COLING 2020; shipped in the Flair NLP library |
| GPT-3 | 2020 | OpenAI | 175B | Prompted LLM | First wide demonstration of zero-shot prompting at scale |
| Flan T5 large and XXL | 2022 | Google Research | 0.78B and 11B | Instruction-tuned LLM | Open-weight, strong on classification and reasoning; outperforms GPT-3 5-shot at 3B scale |
| GPT-4 | 2023 | OpenAI | undisclosed | Prompted LLM | High-accuracy reference classifier |
| Claude 3 family | 2024 | Anthropic | undisclosed | Prompted LLM | Long context suits multi-paragraph documents |
| Llama 3 instruction tuned | 2024 | Meta AI | 8B and 70B | Prompted LLM | Open-weight self-hosted option |
| Sentence-BERT all-mpnet-base-v2 | 2020 | UKP Lab | 0.110B | Embedding similarity | Common baseline for embedding-based classification |
Because the NLI-based approach is the most widely deployed, it is worth tracing the full inference path.
The three-logit output is important. The entailment score alone is used, not the full distribution, because only the entailment side of the inference captures the claim that the text "is about" the candidate topic. The neutral score captures indeterminate cases and the contradiction score captures explicit refutation, neither of which is useful for topic or sentiment classification.
This pipeline requires only the model weights and a Python environment; no fine-tuning, labeled data, or task-specific configuration is needed beyond the template. The Hugging Face Transformers library ships the pipeline as a two-line instantiation:
from transformers import pipeline
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
result = classifier("Apple just announced a new MacBook Pro.", ["technology", "finance", "politics"])
The hypothesis template chosen for an NLI classifier has a measurable effect on accuracy. The default in the Hugging Face zero-shot-classification pipeline is "This example is {label}." Yin and collaborators reported that small phrasing changes can move topic classification accuracy by several points on Yahoo Answers, and affective templates such as "This text expresses {emotion}" outperform generic ones. Templates may include definitional content, for example "This message is a complaint about a product or service," which performs closer to LLM prompts on nuanced categories.
Template selection research has identified several practical patterns:
For LLM prompts the design choices include the wording of the instruction, the order in which labels are listed, and whether definitions are included. Research on prompt engineering for classification has found that label order can shift predictions by several points on borderline examples. A 2025 study (arXiv:2602.04297) specifically examined "prompt underspecification" and found that ambiguous instructions, such as prompts that do not clarify whether to choose the single best label or all applicable labels, inflate variance across model versions and sizes.
Multi-label classification adds another decision: in the NLI formulation, multi-label inference runs each label hypothesis independently and applies an entailment threshold, while single-label inference normalizes scores across labels with a softmax. In LLM prompting, multi-label classification requires an explicit instruction such as "return all categories that apply, separated by commas."
Most text zero-shot evaluation reuses topic, sentiment, and intent corpora from the supervised literature, splitting the label set so that some classes are held out. The most cited benchmark in the NLI lineage is the Yin et al. 2019 suite, which evaluates on topic, emotion, and situation datasets with explicit zero-shot splits. IBM's TTC23, introduced in 2023, expanded coverage to 23 topical datasets and served as the primary benchmark for the EMNLP 2023 study comparing LLM zero-shot methods. The 2026 BTZSC benchmark (arXiv:2603.11991) is the most comprehensive unified evaluation, covering 38 models across 22 datasets spanning sentiment, topic, intent, and emotion classification.
| Dataset | Domain | Classes | Test size | Typical use |
|---|---|---|---|---|
| Yahoo Answers Topics | Q&A forum | 10 | 60,000 | Topic zero-shot benchmark in Yin et al. 2019 |
| AG News | Newswire | 4 | 7,600 | News topic classification |
| DBpedia 14 | Wikipedia abstracts | 14 | 70,000 | Fine-grained topic classification |
| Emotion (CARER) | 6 | 2,000 | Emotion classification | |
| ISEAR | Self-report responses | 7 | 1,533 | Cross-cultural emotion classification |
| 20 Newsgroups | Newsgroup posts | 20 | 7,532 | Long-running topic baseline |
| BoolQ | Wikipedia passages | 2 | 3,270 | Yes-or-no probe |
| RTE | News and Wikipedia | 2 | 3,000 | Textual entailment in SuperGLUE |
| SciCite | Citation contexts | 3 | 1,861 | Citation intent classification |
| Topical-Chat | Grounded dialogue | 8 | varies | Conversation topic detection |
| MultiNLI | Multi-genre pairs | 3 | 19,648 | Training data for NLI entailment models |
| ANLI | Adversarial NLI | 3 | 3,200 | Stress test for entailment classifiers |
| TTC23 | 23 topic domains | varies | varies | IBM 2023 topical LLM study |
| BANKING77 | Banking intents | 77 | 3,080 | Fine-grained intent; large label set |
Holistic benchmarks such as MMLU, HELM, and BIG-bench provide indirect signals of LLM zero-shot competence through many multiple choice problems with unseen labels.
The three paradigms have distinct cost and accuracy profiles. NLI-based classifiers are the smallest and fastest: a 0.4 billion parameter BART or DeBERTa NLI model can score thousands of premise-hypothesis pairs per second on a single GPU, which suits batch labeling and topic-style problems. Entailment probabilities are calibrated and can be thresholded for selective prediction.
Prompted LLM classifiers offer the strongest accuracy on hard categories, including those that require multi-sentence definitions, contextual reasoning, or knowledge that a 400 million parameter encoder does not contain. The cost in compute is one to three orders of magnitude higher than NLI for equivalent accuracy on simple problems, but the gap reverses on complex categories where a small NLI model fails outright. The 2024 study by Bucher and Martino (arXiv:2406.08660) measured this gap at roughly 10 to 25 percentage points in favor of fine-tuned encoders over zero-shot LLMs across diverse classification benchmarks, with the widest gaps on fine-grained intent sets and the narrowest on simple topic sets.
Embedding similarity is the cheapest and is often used as a fast first stage. For 10,000 candidate labels and one input, dot products against precomputed label embeddings are faster than 10,000 NLI calls. Accuracy lags the other paradigms on standalone topic problems, but pure embedding scoring is competitive on concrete domains and pairs well with NLI or LLM reranking. The BTZSC 2026 benchmark identified a fourth architecture class, rerankers, which apply cross-attention scoring at inference time but are trained explicitly for ranking rather than NLI; these achieved the highest average accuracy across the 22 datasets.
A practical guideline is to start with NLI for cheap labels, escalate to an LLM for difficult decisions, and use embeddings to scale to taxonomies with thousands of candidate labels.
| Paradigm | Parameter scale | Latency (relative) | Typical accuracy on easy topics | Typical accuracy on fine-grained intents | Supports multi-label | Cost at scale |
|---|---|---|---|---|---|---|
| NLI cross-encoder | 200M to 500M | Fast | High | Moderate | Yes (threshold per label) | Low |
| Embedding similarity | 100M to 500M | Very fast | Moderate | Low to moderate | Yes (threshold) | Very low |
| Reranker | 300M to 1B | Moderate | Very high | High | Yes | Low to moderate |
| Instruction-tuned LLM (4-12B) | 4B to 12B | Slow | High | High | Yes (with prompt) | Moderate to high |
| Large LLM (70B+) | 70B+ | Very slow | Very high | Very high | Yes (with prompt) | High |
In the generalized variant, the test label set Z includes both labels seen during fine-tuning and labels that were never seen. This is harder than pure zero-shot because classifiers tend to assign higher confidence to seen labels and starve unseen labels of probability mass. The text counterpart appears most often in intent detection, where a chatbot must recognize both trained intents and emerging intents discovered from logs. Approaches include rescaling NLI logits for unseen labels and combining NLI scores with a retrieval signal from labeled support sets.
NLI-based zero-shot classification extends naturally to multilingual settings through cross-lingual NLI models. Moritz Laurer's mDeBERTa-v3-base-xnli-multilingual-nli-2mil7, released in 2022, was trained on more than 2.7 million hypothesis-premise pairs across 27 languages and supports zero-shot classification in all 100 languages covered by the mDeBERTa-v3 pretraining corpus. It achieves 87.1% accuracy on English MNLI and maintains strong performance on held-out languages through cross-lingual transfer, typically above 80% accuracy on the XNLI evaluation set.
The multilingual pipeline is operationally identical to the monolingual one: the practitioner writes label hypotheses in English, the model scores them against input text in any supported language, and the entailment probabilities are returned. This makes multilingual classification practical for organizations that do not have NLI training data in their target languages, as long as the languages are covered by the multilingual pretraining corpus.
LLM-based zero-shot classification is also strong multilingually for well-resourced languages, because large instruction-tuned models such as GPT-4 and Claude 3 are trained on multilingual corpora. Performance degrades on low-resource languages for both NLI and LLM approaches, because both depend on the pretraining distribution.
Zero-shot classification is a default option in settings where labeled data is missing, expensive, or rapidly changing.
Zero-shot classification has well-documented failure modes. Hypothesis sensitivity means accuracy varies with template wording, so practitioners must test multiple templates on a small validation set before trusting a deployment. Calibration is often poor, especially for multi-label settings where decisions depend on absolute scores rather than relative ranks. Performance degrades on abstract, technical, or domain-specific categories that have little support in the NLI or LLM pretraining data.
Label phrasing ambiguity is a specific failure mode distinct from template sensitivity. When two candidate labels overlap in meaning ("complaint" and "negative feedback") the model may split probability mass between them unpredictably. Explicit disambiguation through definitional templates or LLM prompts with definitions mitigates this, at the cost of requiring practitioners to write those definitions.
Language coverage is uneven; most public NLI checkpoints are trained on English corpora such as MultiNLI and ANLI. Pretrained NLI models also inherit dataset artifacts such as hypothesis-only biases. LLM classifiers can be expensive at high request volumes, and outputs may drift across model versions, which complicates reproducibility. Evaluation itself is challenging because conventional metrics assume fixed label sets and matched train-test distributions, neither of which holds in zero-shot settings.
Zero-shot classification is rarely competitive with a well-resourced supervised classifier when sufficient labeled data exists. The 2024 study by Bucher and Martino found that fine-tuned small LLMs outperformed zero-shot generative models by a large margin across diverse classification benchmarks, reinforcing the standard workflow: use zero-shot to bootstrap an annotator, then move to a fine-tuned encoder once data has accumulated. The BTZSC 2026 benchmark similarly found that NLI cross-encoders show diminishing returns with scale, suggesting that simply making the NLI model larger does not close the gap with supervised approaches.