# Zero-Shot Classification Models

> Source: https://aiwiki.ai/wiki/zero-shot_classification_models
> Updated: 2026-07-16
> Categories: AI Models, Natural Language Processing
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Zero-shot classification models** are machine learning systems that assign input text to a set of candidate categories without having seen labeled training examples for those specific categories. Instead of per-class supervised data, these models use auxiliary information about the labels themselves: the natural language form of a category name, an entailment relation between text and a class-name hypothesis, or a shared embedding space for input and labels. Within [natural language processing](/wiki/natural_language_processing), zero-shot classification covers topic labeling, sentiment categorization, [intent detection](/wiki/intent_detection), customer support routing, news triage, and content moderation.

This article focuses on zero-shot classification for text. Models that perform similar generalization for images are covered in [Zero-Shot Image Classification Models](/wiki/zero-shot_image_classification_models). The related setting in which a handful of labeled examples are provided per class is discussed at [Few-Shot Learning](/wiki/few-shot_learning). See also [Text Classification Models](/wiki/text_classification_models) for supervised and semi-supervised approaches to the same tasks.

*See also: [Natural Language Processing Models](/wiki/natural_language_processing_models) and Tasks*

## Definition and contrast with related settings

In standard supervised classification a model is trained on a fixed label set Y and learns a direct mapping from text x to a probability distribution over y in Y. Zero-shot classification removes the assumption that the test label set was observed during training: the model receives a new label set Z at inference time and must produce calibrated scores for each z in Z using only the natural language form of the labels and optional descriptions.[1]

Three settings sit on a spectrum. Supervised classification has many labeled examples per class. [Few-Shot Learning](/wiki/few-shot_learning) provides between one and roughly fifty examples per class, often through in-context prompting.[4] Zero-shot classification provides zero per-class examples, with the model relying on prior knowledge from pretraining plus the label text itself. A variant called generalized zero-shot classification mixes seen and unseen classes at test time and is harder than pure zero-shot, because models tend to be biased toward classes encountered during fine-tuning.[9]

## Historical context

The term zero-shot learning was popularized by computer vision research in the late 2000s. Hugo Larochelle and collaborators introduced zero-data learning in 2008, showing that a network trained on a task description could solve unseen variants.[3] Mark Palatucci, Dean Pomerleau, Geoffrey Hinton, and Tom Mitchell formalized the idea for classification in their 2009 paper "Zero-shot Learning with Semantic Output Codes," which decoded fMRI signals into words that had no training examples by routing predictions through a semantic feature space.[2] Later vision work, including Yongqin Xian and collaborators' "Zero-Shot Learning, the Good, the Bad and the Ugly," defined the generalized zero-shot setup and benchmarked attribute-based methods.[9]

Text classification followed a similar arc. Early systems combined [word embeddings](/wiki/word_embedding) such as [Word2vec](/wiki/word2vec) and [GloVe](/wiki/glove) with class-label vectors to score documents by cosine similarity. Pretrained transformer encoders made it practical to recast classification as another language task, leading to two breakthroughs that define modern text zero-shot classification: the entailment reformulation of Yin, Hay, and Roth in 2019,[1] and the in-context prompting capability of [GPT-3](/wiki/gpt-3) in 2020.[4]

The period from 2021 to 2023 saw rapid consolidation. The [BART](/wiki/bart)-large-MNLI model became the dominant open default for practitioners through its integration into the Hugging Face Transformers zero-shot-classification pipeline.[11][12] Moritz Laurer's 2022 series of [DeBERTa](/wiki/deberta)-v3 checkpoints fine-tuned on stacked NLI corpora pushed entailment-based accuracy further.[8] Instruction-tuned models such as Flan-T5 demonstrated that training on task-description templates generalized reliably to unseen classification tasks.[5] By 2025, the BTZSC benchmark (arXiv:2603.11991, accepted at ICLR 2026) systematically compared 38 models across 22 datasets and found that modern reranker-class models had overtaken traditional NLI cross-encoders, while instruction-tuned LLMs at 4 to 12 billion parameters had narrowed the gap from below.[16]

## Three main paradigms

Contemporary zero-shot text classification falls into three broad approaches: [natural language inference](/wiki/natural_language_inference), prompted generative language models, and embedding similarity.

### NLI-based entailment

The natural language inference approach treats classification as an entailment problem. The input text is the premise, and for each candidate label the system constructs a short hypothesis sentence using a template such as "This text is about politics" or "This example expresses anger." A pretrained NLI classifier scores the probability that the premise entails the hypothesis, and the label with the highest entailment probability is returned. This formulation was introduced for text by Wenpeng Yin, Jamaal Hay, and Dan Roth in their 2019 EMNLP paper "Benchmarking Zero-shot Text Classification: Datasets, Evaluation and Entailment Approach," arXiv 1909.00161.[1] They showed that an NLI model fine-tuned on MultiNLI could classify topics, emotions, and situations on datasets it had never been trained on, often matching more complex methods of the time.[1]

The mechanics of the entailment scoring step are worth unpacking. An NLI model produces three logits per premise-hypothesis pair: entailment, neutral, and contradiction. For single-label classification the entailment scores across all candidate labels are passed through a softmax together, turning them into a probability distribution over labels.[11][13] For multi-label classification each label is scored independently and an entailment threshold is applied, because labels are not mutually exclusive and a shared softmax would suppress co-occurring categories.[11][13]

The NLI approach is widespread on the Hugging Face Hub through the zero-shot-classification pipeline.[11] The most downloaded such model, facebook/bart-large-mnli, uses the [BART](/wiki/bart) encoder-decoder fine-tuned on MultiNLI and produces three logits per pair for entailment, neutral, and contradiction.[12] The model uses a bidirectional encoder paired with an autoregressive decoder,[10] and the MNLI fine-tuning stage adapts this architecture to the three-class inference task.[12] For single-label problems the entailment scores across labels are softmaxed together; for multi-label problems each label is scored independently.[12]

DeBERTa-v3-large-mnli-fever-anli-ling-wanli, released by Moritz Laurer in 2022, extended this approach by stacking five NLI corpora: MultiNLI, Fever-NLI, Adversarial NLI (ANLI), LingNLI, and WANLI, totaling 885,242 hypothesis-premise pairs.[8] The result was the top-ranked NLI model on the Hugging Face Hub as of June 2022 and achieved state-of-the-art scores on the ANLI adversarial benchmark.[8] At roughly 435 million parameters and throughput of approximately 980 text pairs per second on an A100 GPU, it offered the strongest available accuracy-throughput balance for NLI-based zero-shot classification at the time.[8]

### Prompted generative language models

The second paradigm uses a [large language model](/wiki/large_language_model) such as [GPT-3](/wiki/gpt-3), [GPT-4](/wiki/gpt-4), [Claude](/wiki/claude), or [Llama 3](/wiki/llama_3) to perform classification directly through a natural language prompt. A typical prompt provides the input text, a list of candidate labels with optional definitions, and an instruction to return the most appropriate label. The seminal paper is Tom Brown and collaborators' "Language Models are Few-Shot Learners," arXiv 2005.14165, which introduced GPT-3 and showed that very large language models can solve a wide range of classification and reasoning tasks with no gradient updates.[4] Instruction-tuned models such as Flan-T5 by Hyung Won Chung and collaborators, arXiv 2210.11416, made zero-shot prompting reliable on smaller open models by training on thousands of task descriptions.[5]

Prompted classification is the most flexible paradigm because the model can read multi-sentence class definitions and handle hierarchical taxonomies. It is also the most expensive per call and the most sensitive to prompt phrasing. A 2023 IBM study published at EMNLP Findings ("Zero-shot Topical Text Classification with LLMs: an Experimental Study") evaluated LLMs including Flan-T5-XXL across 23 topical classification datasets and found that task-specific fine-tuning on top of instruction-tuned bases could improve further still, suggesting that pure prompting leaves accuracy on the table for well-defined topic taxonomies.[15] A 2024 study (arXiv:2406.08660) confirmed that fine-tuned smaller models still outperform zero-shot LLMs by roughly 10 to 25 percentage points on fine-grained intent and topic classification, though the gap narrows on coarse-grained tasks where label definitions are unambiguous.[14]

### Embedding similarity

The third paradigm computes a [sentence embedding](/wiki/sentence-bert) for the input and a separate embedding for each candidate label, then assigns the label whose embedding has the highest [cosine similarity](/wiki/cosine_similarity) with the input. Sentence-BERT, introduced by Nils Reimers and Iryna Gurevych in their 2019 EMNLP paper "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks," arXiv 1908.10084, gave a fast way to produce semantically meaningful sentence vectors at scale.[7] Modern embedding APIs from [OpenAI](/wiki/openai), Cohere, and open models such as BAAI BGE are commonly used in this setting.

Embedding similarity is the cheapest paradigm at inference time and is often used as a baseline or as a first-stage retriever. Its accuracy on abstract or contrastive labels tends to be lower than NLI or LLM methods, because the model has no explicit signal that labels should be treated as decision boundaries rather than just topics. The 2026 BTZSC benchmark found that strong embedding models offer the best accuracy-to-latency trade-off across its 22 datasets, making them a practical first choice for high-volume pipelines where latency budgets are tight.[16] Reranker-class models, which apply a cross-attention scoring step similar to NLI cross-encoders but trained explicitly for ranking rather than entailment, achieved the highest overall accuracy in the same benchmark.[16]

## Notable models

The following table lists widely used models.

| Model | Year | Developer | Size | Approach | Key trait |
|-------|------|-----------|------|----------|-----------|
| [BART](/wiki/bart) large MNLI (facebook/bart-large-mnli) | 2019 | Meta AI | 0.4B | NLI entailment | Default Hugging Face zero-shot pipeline, 2.9M+ monthly downloads[12] |
| [RoBERTa](/wiki/roberta) large MNLI | 2019 | Meta AI | 0.355B | NLI entailment | Strong NLI baseline in early benchmarks |
| [DeBERTa](/wiki/deberta) v3 large MNLI Fever ANLI Ling WANLI | 2022 | Moritz Laurer | 0.435B | NLI entailment | Trained on 885k NLI pairs; top NLI score on Hugging Face Hub in 2022; ~980 pairs/sec on A100[8] |
| mDeBERTa v3 base MNLI XNLI | 2022 | Moritz Laurer | 0.185B | Multilingual NLI entailment | Supports 100 languages via cross-lingual transfer; trained on XNLI and multilingual-NLI-26lang-2mil7[18] |
| TARS | 2020 | Halder, Akbik et al. | 0.110B | Universal binary classifier | COLING 2020; shipped in the Flair NLP library[6] |
| [GPT-3](/wiki/gpt-3) | 2020 | OpenAI | 175B | Prompted LLM | First wide demonstration of zero-shot prompting at scale[4] |
| Flan T5 large and XXL | 2022 | Google Research | 0.78B and 11B | Instruction-tuned LLM | Open-weight, strong on classification and reasoning; outperforms GPT-3 5-shot at 3B scale[5] |
| [GPT-4](/wiki/gpt-4) | 2023 | OpenAI | undisclosed | Prompted LLM | High-accuracy reference classifier |
| [Claude](/wiki/claude) 3 family | 2024 | Anthropic | undisclosed | Prompted LLM | Long context suits multi-paragraph documents |
| [Llama 3](/wiki/llama_3) instruction tuned | 2024 | Meta AI | 8B and 70B | Prompted LLM | Open-weight self-hosted option |
| [Sentence-BERT](/wiki/sentence-bert) all-mpnet-base-v2 | 2020 | UKP Lab | 0.110B | Embedding similarity | Common baseline for embedding-based classification |

## How the NLI entailment pipeline works

Because the NLI-based approach is the most widely deployed, it is worth tracing the full inference path.

1. The practitioner selects a set of candidate label strings, for example "politics," "sports," "technology," and "health."
2. A hypothesis template is applied to each label. The Hugging Face default is "This example is {label}."[11] Domain-specific templates such as "This text discusses {label}" or "The topic of this passage is {label}" can be substituted.
3. Each (input-text, hypothesis) pair is passed through the NLI model, which produces entailment, neutral, and contradiction logits.[12]
4. For single-label classification, the entailment logits for all candidate labels are collected into a vector and normalized with a softmax. For multi-label classification, each entailment probability is compared against a threshold independently.[11][13]
5. The label with the highest normalized entailment probability is returned as the prediction.

The three-logit output is important. The entailment score alone is used, not the full distribution, because only the entailment side of the inference captures the claim that the text "is about" the candidate topic.[13] The neutral score captures indeterminate cases and the contradiction score captures explicit refutation, neither of which is useful for topic or sentiment classification.

This pipeline requires only the model weights and a Python environment; no fine-tuning, labeled data, or task-specific configuration is needed beyond the template. The Hugging Face Transformers library ships the pipeline as a two-line instantiation:[11]

```python
from transformers import pipeline
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
result = classifier("Apple just announced a new MacBook Pro.", ["technology", "finance", "politics"])
```

## Hypothesis templates and prompt design

The hypothesis template chosen for an NLI classifier has a measurable effect on accuracy. The default in the Hugging Face zero-shot-classification pipeline is "This example is {label}."[11] Yin and collaborators reported that small phrasing changes can move topic classification accuracy by several points on Yahoo Answers, and affective templates such as "This text expresses {emotion}" outperform generic ones.[1] Templates may include definitional content, for example "This message is a complaint about a product or service," which performs closer to LLM prompts on nuanced categories.

Template selection research has identified several practical patterns:

- **Generic templates** such as "This example is {label}" work well across diverse topic sets but underperform on emotion or intent tasks.[1]
- **Domain-matched templates** such as "The intent of this utterance is {label}" improve intent detection, because the model's NLI training included utterance-intent pairs.
- **Definitional templates** that expand the label name into a short description ("This text discusses electoral politics or government policy") behave similarly to providing class definitions in an LLM prompt, effectively bridging the two paradigms.
- **Ensemble templates** that average entailment scores across multiple phrasings of the same label reduce variance from template choice at the cost of additional inference calls.

For LLM prompts the design choices include the wording of the instruction, the order in which labels are listed, and whether definitions are included. Research on [prompt engineering](/wiki/prompt_engineering) for classification has found that label order can shift predictions by several points on borderline examples.[17] A 2025 study (arXiv:2602.04297) specifically examined "prompt underspecification" and found that ambiguous instructions, such as prompts that do not clarify whether to choose the single best label or all applicable labels, inflate variance across model versions and sizes.[17]

Multi-label classification adds another decision: in the NLI formulation, multi-label inference runs each label hypothesis independently and applies an entailment threshold, while single-label inference normalizes scores across labels with a softmax.[11] In LLM prompting, multi-label classification requires an explicit instruction such as "return all categories that apply, separated by commas."

## Datasets and benchmarks

Most text zero-shot evaluation reuses topic, sentiment, and intent corpora from the supervised literature, splitting the label set so that some classes are held out. The most cited benchmark in the NLI lineage is the Yin et al. 2019 suite, which evaluates on topic, emotion, and situation datasets with explicit zero-shot splits.[1] IBM's TTC23, introduced in 2023, expanded coverage to 23 topical datasets and served as the primary benchmark for the EMNLP 2023 study comparing LLM zero-shot methods.[15] The 2026 BTZSC benchmark (arXiv:2603.11991) is the most comprehensive unified evaluation, covering 38 models across 22 datasets spanning sentiment, topic, intent, and emotion classification.[16]

| Dataset | Domain | Classes | Test size | Typical use |
|---------|--------|---------|-----------|-------------|
| Yahoo Answers Topics | Q&A forum | 10 | 60,000 | Topic zero-shot benchmark in Yin et al. 2019[1] |
| AG News | Newswire | 4 | 7,600 | News topic classification |
| DBpedia 14 | Wikipedia abstracts | 14 | 70,000 | Fine-grained topic classification |
| Emotion (CARER) | Twitter | 6 | 2,000 | Emotion classification |
| ISEAR | Self-report responses | 7 | 1,533 | Cross-cultural emotion classification |
| 20 Newsgroups | Newsgroup posts | 20 | 7,532 | Long-running topic baseline |
| BoolQ | Wikipedia passages | 2 | 3,270 | Yes-or-no probe |
| RTE | News and Wikipedia | 2 | 3,000 | Textual entailment in [SuperGLUE](/wiki/superglue) |
| SciCite | Citation contexts | 3 | 1,861 | Citation intent classification |
| Topical-Chat | Grounded dialogue | 8 | varies | Conversation topic detection |
| MultiNLI | Multi-genre pairs | 3 | 19,648 | Training data for NLI entailment models[1] |
| ANLI | Adversarial NLI | 3 | 3,200 | Stress test for entailment classifiers[8] |
| TTC23 | 23 topic domains | varies | varies | IBM 2023 topical LLM study[15] |
| BANKING77 | Banking intents | 77 | 3,080 | Fine-grained intent; large label set |

Holistic benchmarks such as MMLU, HELM, and BIG-bench provide indirect signals of LLM zero-shot competence through many multiple choice problems with unseen labels.

## Comparison of methods

The three paradigms have distinct cost and accuracy profiles. NLI-based classifiers are the smallest and fastest: a 0.4 billion parameter BART or DeBERTa NLI model can score thousands of premise-hypothesis pairs per second on a single GPU, which suits batch labeling and topic-style problems.[8] Entailment probabilities are calibrated and can be thresholded for selective prediction.

Prompted LLM classifiers offer the strongest accuracy on hard categories, including those that require multi-sentence definitions, contextual reasoning, or knowledge that a 400 million parameter encoder does not contain. The cost in compute is one to three orders of magnitude higher than NLI for equivalent accuracy on simple problems, but the gap reverses on complex categories where a small NLI model fails outright. The 2024 study by Bucher and Martino (arXiv:2406.08660) measured this gap at roughly 10 to 25 percentage points in favor of fine-tuned encoders over zero-shot LLMs across diverse classification benchmarks, with the widest gaps on fine-grained intent sets and the narrowest on simple topic sets.[14]

Embedding similarity is the cheapest and is often used as a fast first stage. For 10,000 candidate labels and one input, dot products against precomputed label embeddings are faster than 10,000 NLI calls. Accuracy lags the other paradigms on standalone topic problems, but pure embedding scoring is competitive on concrete domains and pairs well with NLI or LLM reranking. The BTZSC 2026 benchmark identified a fourth architecture class, rerankers, which apply cross-attention scoring at inference time but are trained explicitly for ranking rather than NLI; these achieved the highest average accuracy across the 22 datasets.[16]

A practical guideline is to start with NLI for cheap labels, escalate to an LLM for difficult decisions, and use embeddings to scale to taxonomies with thousands of candidate labels.

| Paradigm | Parameter scale | Latency (relative) | Typical accuracy on easy topics | Typical accuracy on fine-grained intents | Supports multi-label | Cost at scale |
|---------|----------|-------|------------|-----------|------------|-----|
| NLI cross-encoder | 200M to 500M | Fast | High | Moderate | Yes (threshold per label) | Low |
| Embedding similarity | 100M to 500M | Very fast | Moderate | Low to moderate | Yes (threshold) | Very low |
| Reranker | 300M to 1B | Moderate | Very high | High | Yes | Low to moderate |
| Instruction-tuned LLM (4-12B) | 4B to 12B | Slow | High | High | Yes (with prompt) | Moderate to high |
| Large LLM (70B+) | 70B+ | Very slow | Very high | Very high | Yes (with prompt) | High |

## Generalized zero-shot classification

In the generalized variant, the test label set Z includes both labels seen during fine-tuning and labels that were never seen. This is harder than pure zero-shot because classifiers tend to assign higher confidence to seen labels and starve unseen labels of probability mass.[9] The text counterpart appears most often in intent detection, where a chatbot must recognize both trained intents and emerging intents discovered from logs. Approaches include rescaling NLI logits for unseen labels and combining NLI scores with a retrieval signal from labeled support sets.

## Multilingual zero-shot classification

NLI-based zero-shot classification extends naturally to multilingual settings through cross-lingual NLI models. Moritz Laurer's mDeBERTa-v3-base-xnli-multilingual-nli-2mil7, released in 2022, was trained on more than 2.7 million hypothesis-premise pairs across 27 languages and supports zero-shot classification in all 100 languages covered by the mDeBERTa-v3 pretraining corpus.[18] It achieves 87.1% accuracy on English MNLI and maintains strong performance on held-out languages through cross-lingual transfer, typically above 80% accuracy on the XNLI evaluation set.[18]

The multilingual pipeline is operationally identical to the monolingual one: the practitioner writes label hypotheses in English, the model scores them against input text in any supported language, and the entailment probabilities are returned.[18] This makes multilingual classification practical for organizations that do not have NLI training data in their target languages, as long as the languages are covered by the multilingual pretraining corpus.

LLM-based zero-shot classification is also strong multilingually for well-resourced languages, because large instruction-tuned models such as GPT-4 and Claude 3 are trained on multilingual corpora. Performance degrades on low-resource languages for both NLI and LLM approaches, because both depend on the pretraining distribution.

## Applications

Zero-shot classification is a default option in settings where labeled data is missing, expensive, or rapidly changing.

* Rapid prototyping. Engineers stand up a working classifier within minutes by listing labels and feeding them to a Hugging Face pipeline.[11]
* Customer support routing. Companies route tickets to billing, technical support, or account-change teams without collecting labeled tickets first.
* [Intent detection](/wiki/intent_detection) and discovery. Chatbots recognize emerging intents that were not part of the training distribution.
* Content moderation. Platforms flag policy categories such as harassment, self-harm, or spam, and update the label set as policies evolve. Google's ads moderation system uses cross-modal co-embeddings for zero-shot image policy classification, with the label hypotheses written as text descriptions of policy violations.
* News and document triage. Newsrooms classify incoming articles into shifting taxonomies of topics, geographies, and events.
* Compliance review. Legal teams classify contracts and communications against checklists of clause types and risk categories.
* Survey and feedback analysis. Open-ended responses are tagged with sentiment, topic, and actionability.
* Multilingual screening. Multilingual NLI checkpoints such as mDeBERTa classify text across languages using English label names.[18]
* Dynamic taxonomy management. Platforms that continuously add or retire product categories, moderation policies, or ontology nodes can update the label set at inference time with no retraining cycle.

## Limitations

Zero-shot classification has well-documented failure modes. Hypothesis sensitivity means accuracy varies with template wording, so practitioners must test multiple templates on a small validation set before trusting a deployment.[1][13] Calibration is often poor, especially for multi-label settings where decisions depend on absolute scores rather than relative ranks. Performance degrades on abstract, technical, or domain-specific categories that have little support in the NLI or LLM pretraining data.

Label phrasing ambiguity is a specific failure mode distinct from template sensitivity. When two candidate labels overlap in meaning ("complaint" and "negative feedback") the model may split probability mass between them unpredictably. Explicit disambiguation through definitional templates or LLM prompts with definitions mitigates this, at the cost of requiring practitioners to write those definitions.

Language coverage is uneven; most public NLI checkpoints are trained on English corpora such as MultiNLI and ANLI.[8] Pretrained NLI models also inherit dataset artifacts such as hypothesis-only biases. LLM classifiers can be expensive at high request volumes, and outputs may drift across model versions, which complicates reproducibility.[17] Evaluation itself is challenging because conventional metrics assume fixed label sets and matched train-test distributions, neither of which holds in zero-shot settings.[1]

Zero-shot classification is rarely competitive with a well-resourced supervised classifier when sufficient labeled data exists. The 2024 study by Bucher and Martino found that fine-tuned small LLMs outperformed zero-shot generative models by a large margin across diverse classification benchmarks, reinforcing the standard workflow: use zero-shot to bootstrap an annotator, then move to a fine-tuned encoder once data has accumulated.[14] The BTZSC 2026 benchmark similarly found that NLI cross-encoders show diminishing returns with scale, suggesting that simply making the NLI model larger does not close the gap with supervised approaches.[16]

## See also

- [Zero-Shot Learning](/wiki/zero-shot_learning)
- [Natural Language Inference](/wiki/natural_language_inference)
- [Text Classification Models](/wiki/text_classification_models)
- [Few-Shot Learning](/wiki/few-shot_learning)
- [Prompt Engineering](/wiki/prompt_engineering)
- [BART](/wiki/bart)
- [DeBERTa](/wiki/deberta)
- [Large Language Model](/wiki/large_language_model)
- [Sentence-BERT](/wiki/sentence-bert)
- [Intent Detection](/wiki/intent_detection)
- [Zero-Shot Image Classification Models](/wiki/zero-shot_image_classification_models)

## References

[1] Yin, Wenpeng, Jamaal Hay, and Dan Roth. "Benchmarking Zero-shot Text Classification: Datasets, Evaluation and Entailment Approach." EMNLP-IJCNLP 2019, arXiv:1909.00161. https://arxiv.org/abs/1909.00161
[2] Palatucci, Mark, Dean Pomerleau, Geoffrey Hinton, and Tom Mitchell. "Zero-shot Learning with Semantic Output Codes." NeurIPS 2009. https://proceedings.neurips.cc/paper/2009/hash/1543843a4723ed2ab08e18053ae6dc5b-Abstract.html
[3] Larochelle, Hugo, Dumitru Erhan, and Yoshua Bengio. "Zero-data Learning of New Tasks." AAAI 2008. https://www.aaai.org/Papers/AAAI/2008/AAAI08-103.pdf
[4] Brown, Tom B., et al. "Language Models are Few-Shot Learners." NeurIPS 2020, arXiv:2005.14165. https://arxiv.org/abs/2005.14165
[5] Chung, Hyung Won, et al. "Scaling Instruction-Finetuned Language Models." 2022, arXiv:2210.11416. https://arxiv.org/abs/2210.11416
[6] Halder, Kishaloy, Alan Akbik, Josip Capelle, and Roland Vollgraf. "Task-Aware Representation of Sentences for Generic Text Classification." COLING 2020. https://aclanthology.org/2020.coling-main.285/
[7] Reimers, Nils, and Iryna Gurevych. "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks." EMNLP-IJCNLP 2019, arXiv:1908.10084. https://arxiv.org/abs/1908.10084
[8] Laurer, Moritz, et al. "Less Annotating, More Classifying: Deep Transfer Learning with BERT-NLI." 2022. https://huggingface.co/MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli
[9] Xian, Yongqin, Bernt Schiele, and Zeynep Akata. "Zero-Shot Learning: The Good, the Bad and the Ugly." CVPR 2017. https://openaccess.thecvf.com/content_cvpr_2017/papers/Xian_Zero-Shot_Learning_-_CVPR_2017_paper.pdf
[10] Lewis, Mike, et al. "BART: Denoising Sequence-to-Sequence Pre-training." 2019, arXiv:1910.13461. https://arxiv.org/abs/1910.13461
[11] Hugging Face Transformers. "Zero-Shot Classification Pipeline." https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.ZeroShotClassificationPipeline
[12] Hugging Face model card, facebook/bart-large-mnli. https://huggingface.co/facebook/bart-large-mnli
[13] Davison, Joe. "Zero-Shot Learning in Modern NLP." Hugging Face blog, 2020. https://joeddav.github.io/blog/2020/05/29/ZSL.html
[14] Bucher, Marc-Nicolas, and Martino. "Fine-Tuned 'Small' LLMs (Still) Significantly Outperform Zero-Shot Generative AI Models in Text Classification." arXiv:2406.08660. 2024. https://arxiv.org/abs/2406.08660
[15] Sahar, Ori, et al. "Zero-shot Topical Text Classification with LLMs: an Experimental Study." EMNLP Findings 2023. https://aclanthology.org/2023.findings-emnlp.647/
[16] Aarab, Ilias, et al. "BTZSC: A Benchmark for Zero-Shot Text Classification Across Cross-Encoders, Embedding Models, Rerankers and LLMs." arXiv:2603.11991. ICLR 2026. https://arxiv.org/abs/2603.11991
[17] Revisiting Prompt Sensitivity in Large Language Models for Text Classification. arXiv:2602.04297. 2025. https://arxiv.org/abs/2602.04297
[18] Laurer, Moritz. mDeBERTa-v3-base-xnli-multilingual-nli-2mil7. Hugging Face model card. 2022. https://huggingface.co/MoritzLaurer/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7