# Natural language inference (NLI)

> Source: https://aiwiki.ai/wiki/natural_language_inference
> Updated: 2026-06-23
> Categories: AI Benchmarks, Natural Language Processing, Reasoning Models
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Natural language inference (NLI)**, also known as **recognising textual entailment (RTE)**, is the [natural language processing](/wiki/natural_language_processing) task of deciding whether a hypothesis sentence is *entailed* by, *contradicts*, or is *neutral* toward a premise sentence. Given a pair of natural-language sentences (premise *P*, hypothesis *H*), an NLI system outputs one of three labels: *entailment* (H follows from P), *contradiction* (H is incompatible with P), or *neutral* (the premise gives no clear evidence either way). The task was crystallised in its modern form by Dagan, Glickman and Magnini in the 2006 PASCAL Recognising Textual Entailment Challenge [3] and was scaled into the deep-learning era by the Stanford Natural Language Inference (SNLI) corpus, a set of 570,152 human-written English sentence pairs released by Bowman, Angeli, Potts and Manning in 2015 [1].

NLI sits at the centre of natural language understanding evaluation. Whereas surface-level tasks such as part-of-speech tagging or named-entity recognition can be solved with mostly local features, deciding whether one sentence entails another forces a model to handle synonymy, paraphrase, lexical entailment relations, negation, quantifiers, coreference, and a healthy amount of world knowledge. That is why NLI shows up as a sub-task in nearly every general language understanding benchmark, from the [GLUE benchmark](/wiki/glue_benchmark) and [SuperGLUE](/wiki/superglue) through to multilingual suites like XNLI and XTREME. Three of GLUE's nine tasks are explicitly NLI (MNLI, RTE, WNLI), and a fourth (QNLI) was reformulated as entailment [8].

## What is natural language inference?

The formal version of the task is straightforward. The model receives a pair `(premise, hypothesis)` and outputs a label from `{entailment, contradiction, neutral}`. SNLI, MultiNLI, ANLI and XNLI all use this three-way label set. Some earlier and more specialised datasets (RTE-1 through RTE-3, SciTail, FEVER) collapse the label space to two classes (entailment vs. not-entailment, or supports vs. refutes vs. not-enough-info). The classic three-way examples from the SNLI paper illustrate what each label means in practice.

| Label | Premise | Hypothesis | Why |
|---|---|---|---|
| Entailment | A man inspects the uniform of a figure. | The man is inspecting his uniform. | Inspecting a uniform that belongs to a figure he is associated with implies inspecting his uniform in the everyday reading of the sentence. |
| Contradiction | A man is playing guitar. | The man is sleeping. | Playing guitar and sleeping cannot both be true of the same man at the same time. |
| Neutral | A boy is jumping on a skateboard in the middle of a red bridge. | The boy does a skateboarding trick. | The premise is consistent with a trick but does not state that one is being performed. |

The definition of entailment used in NLI corpora is loose by design. It is the everyday, common-sense reading rather than strict logical entailment. Bowman and colleagues describe SNLI labelling instructions in terms of a likely description of the same scene, which means models are expected to make the kind of pragmatic inferences a fluent reader would make, not the airtight deductions a theorem prover would [1]. Estimated human accuracy on SNLI, measured as the average agreement of five annotators with the consensus gold label, is 87.7% on the test set [1].

## Why does NLI matter?

NLI is a strong proxy for general language understanding because so many other tasks reduce to it. Question answering can be cast as: does the passage entail the candidate answer when slotted into a question template? Fact verification asks whether a knowledge base supports or refutes a claim, which is the same labelling problem with a different name. Summarisation evaluation increasingly uses NLI to check whether a summary is entailed by its source document, surfacing hallucinations. Retrieval-augmented generation (RAG) pipelines use NLI to filter or score retrieved evidence. Yin, Hay and Roth's 2019 paper "Benchmarking Zero-shot Text Classification" showed that you can even cast topic classification, emotion classification and many other label-prediction problems as entailment by templating the candidate label into a hypothesis like "This text is about politics." [18] That formulation underlies the popular `zero-shot-classification` pipeline in the Hugging Face `transformers` library, which loads a [BART](/wiki/bart) model fine-tuned on MultiNLI by default [21].

In short, if a model can do NLI well, it can be repurposed for a long list of downstream classification problems with no further training.

## When was NLI created? A short history

The lineage of NLI starts before the deep-learning era and stretches back to formal-semantics work in computational linguistics.

**FraCaS test suite (Cooper et al. 1996).** The European FraCaS project (Framework for Computational Semantics) produced a small, hand-crafted suite of 346 inference problems, each consisting of one or more premises followed by a yes/no/unknown question [4]. The examples were designed to probe specific phenomena like generalised quantifiers, plurals, anaphora and tense. FraCaS predates RTE by a decade and remains a useful diagnostic for compositional inference, especially for symbolic systems.

**PASCAL RTE Challenges (Dagan, Glickman & Magnini 2006).** The PASCAL Network of Excellence ran a series of Recognising Textual Entailment challenges starting with RTE-1 in 2005. Each challenge released a few hundred premise-hypothesis pairs drawn from real applications such as information extraction, question answering and summarisation, with binary entailment labels [3]. Seventeen teams submitted to RTE-1, and the series ran through RTE-7. The PASCAL RTE work established "is hypothesis H entailed by text T?" as a competitive benchmark and gave the field its name (RTE), which is still used interchangeably with NLI.

**SNLI (Bowman, Angeli, Potts & Manning 2015).** SNLI was the breakthrough that made data-hungry neural models viable for entailment. The authors collected 570,152 English sentence pairs by showing crowd workers a Flickr30k image caption (the premise) and asking them to write three new captions: one entailed, one contradicted, one neutral [1]. The result was a dataset roughly two orders of magnitude larger than anything that had come before, and 98% of the validation pairs had at least three of five independent raters agreeing on the label [1]. SNLI was published at EMNLP 2015 (arXiv:1508.05326).

**MultiNLI (Williams, Nangia & Bowman 2018).** MultiNLI extended SNLI's recipe across ten genres of written and spoken English, including fiction, letters, telephone speech, travel guides, government documents and the 9/11 Commission report [2]. The result was 433k pairs split into two test sets: "matched" (genres seen at training time) and "mismatched" (held-out genres). MultiNLI was published at NAACL 2018 (arXiv:1704.05426).

**Specialised datasets.** SciTail (Khot, Sabharwal & Clark, AAAI 2018) was built from real science-exam questions and web sentences, producing 27k pairs labelled `entails` or `neutral` [5]. MedNLI (Romanov & Shivade, EMNLP 2018) drew premises from the MIMIC-III clinical-notes corpus and used physician annotators to label 14k pairs in the medical domain [6].

**Adversarial NLI (Nie, Williams, Dinan, Bansal, Weston & Kiela, ACL 2020).** ANLI was collected via a human-and-model-in-the-loop procedure. Annotators wrote hypotheses that fooled a current state-of-the-art NLI model, then those examples became the training set for a stronger model, which was then attacked again. Three rounds (R1, R2, R3) gave a dataset of 162,865 examples that progressively defeats successive generations of models [7]. ANLI is built specifically to push past the saturation point of SNLI and MultiNLI.

**XNLI (Conneau et al. 2018).** XNLI translated MultiNLI's development and test sets into 14 languages: French, Spanish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili and Urdu [10]. The training set is still English MultiNLI, so XNLI tests cross-lingual transfer rather than monolingual modelling. It became the standard benchmark for multilingual sentence encoders.

## Standard datasets

| Dataset | Year | Size | Source domain | Labels | Notable |
|---|---|---|---|---|---|
| FraCaS | 1996 | 346 problems | Hand-crafted, formal semantics | yes/no/unknown | Tests compositional phenomena (quantifiers, plurals, anaphora) |
| RTE-1 to RTE-7 | 2005 to 2011 | A few hundred to a few thousand pairs per round | News, IE, QA outputs | Binary (entailment / not) | Gave RTE its name; sub-task in [GLUE benchmark](/wiki/glue_benchmark) and [SuperGLUE](/wiki/superglue) |
| SNLI | 2015 | 570,152 pairs | Flickr30k captions | entailment, contradiction, neutral | First dataset large enough for deep neural training |
| MultiNLI | 2018 | 433k pairs | 10 genres of written and spoken English | entailment, contradiction, neutral | Matched and mismatched test sets for cross-genre evaluation |
| SciTail | 2018 | 27k pairs (10k entails, 17k neutral) | Science exam questions and web text | entails, neutral | First NLI built entirely from naturally occurring sentences |
| MedNLI | 2018 | 14k pairs | MIMIC-III clinical notes | entailment, contradiction, neutral | Annotated by physicians; clinical-domain probe |
| FEVER | 2018 | 185k claims | Wikipedia, with evidence sentences | Supports, Refutes, NotEnoughInfo | Couples claim verification with retrieval of evidence |
| ANLI | 2020 | 162,865 pairs across R1, R2, R3 | Wikipedia and other corpora | entailment, contradiction, neutral | Iteratively built to defeat current models |
| XNLI | 2018 | 7.5k dev/test pairs in each of 14 languages | MultiNLI translated | entailment, contradiction, neutral | Standard cross-lingual NLI benchmark |
| QNLI (in [GLUE benchmark](/wiki/glue_benchmark)) | 2018 | ~110k pairs | Derived from SQuAD | entailment, not_entailment | Question / sentence pairs reformulated as entailment |
| WNLI (in [GLUE benchmark](/wiki/glue_benchmark)) | 2018 | 634 train / 71 dev pairs | Winograd Schema Challenge | entailment, not_entailment | Tiny but coreference-focused; notoriously tricky to beat majority baseline |
| CB (in [SuperGLUE](/wiki/superglue)) | 2019 | 250 train / 56 val pairs | CommitmentBank (embedded clauses) | entailment, contradiction, neutral | Probes inference over embedded clauses with reporting verbs |

## How have NLI systems evolved?

The history of NLI methods tracks the broader history of NLP. Early systems leaned on lexical and syntactic alignment with hand-crafted features. The arrival of SNLI made end-to-end neural models practical. Pre-trained [transformer](/wiki/bert)-based models then collapsed most of the gap to human performance on the in-domain corpora. Large language models (LLMs) handle in-distribution NLI in zero shot, although adversarial benchmarks remain difficult.

| Era | Representative methods | What they did | SNLI test accuracy (approx.) |
|---|---|---|---|
| Pre-deep-learning (RTE era) | Lexical overlap, alignment, hand-crafted features (e.g., MacCartney & Manning's NatLog) | Compute alignment between hypothesis and premise tokens, with rule-based handling of negation and quantifier monotonicity | RTE-1/2/3 binary accuracies in the high 50s to low 60s for top systems |
| Early neural | LSTM-with-attention (Rocktaschel et al. 2016), LSTM-based reading models (Cheng et al. 2016) | Encode premise and hypothesis with LSTMs, attend across them | Around 83% on SNLI [16] |
| Decomposable Attention (Parikh, Tackstrom, Das & Uszkoreit, EMNLP 2016) | Cross-attention between premise and hypothesis, no LSTM | Showed that attention alone, without word-order modelling, can do NLI | 86.3% on SNLI [14] |
| ESIM (Chen, Zhu, Ling, Wei, Jiang & Inkpen, ACL 2017) | BiLSTMs with explicit local inference and composition layers | Strong sequence model that beat decomposable attention | 88.6% on SNLI [15] |
| Pre-trained transformers (since 2018) | [BERT](/wiki/bert), [RoBERTa](/wiki/roberta), [DeBERTa](/wiki/deberta), ALBERT, ELECTRA | Fine-tune a self-supervised transformer on the NLI training set | High 90s on SNLI; ~90 to 91% on MultiNLI matched for base-size models |
| Large LLMs (since 2020) | GPT-3, GPT-4, Claude, Gemini | Zero- or few-shot prompting; no task-specific fine-tuning required for in-distribution NLI | Strong on SNLI/MNLI in zero shot; mixed on ANLI |

The transition from feature-engineered systems to neural networks happened almost entirely on SNLI's back. By the time MultiNLI was released, ESIM and decomposable attention were the standard baselines. By late 2018, [BERT](/wiki/bert) had pushed past 90% on MultiNLI matched [11], and successor models such as [RoBERTa](/wiki/roberta) (Liu et al. 2019) [12] and [DeBERTa](/wiki/deberta) (He et al. 2021) [13] widened the lead.

## How accurate are NLI systems?

Numbers in NLI papers vary by version of the test set and by whether the model is fine-tuned or evaluated zero-shot. The figures below come from the corresponding original papers and benchmark leaderboards as last published.

| Benchmark | Human accuracy | Best reported model accuracy | Notes |
|---|---|---|---|
| SNLI test | 87.7% (Bowman et al. 2015) [1] | 93.1% (EFL + RoBERTa-large, Wang et al. 2021) | Saturation has been near for several years |
| MultiNLI matched | ~92% (Williams, Nangia & Bowman 2018) [2] | ~91 to 92% for DeBERTa-v3-large; high 90s for very large models on the GLUE leaderboard | Mismatched scores are typically within 1 point |
| ANLI R1 | ~92% on R1 according to Nie et al. 2020 [7] | ~73.8% (RoBERTa-large) [7] | Fine-tuned models trained on SNLI+MNLI+FEVER+ANLI |
| ANLI R2 | similar human range | ~48.9% (RoBERTa-large) [7] | R2 is harder by construction |
| ANLI R3 | similar human range | ~44.4% (RoBERTa-large) [7] | Still well below human accuracy |
| GLUE MNLI | 92.0% human [8] | 91.9% (DeBERTa-V3 base) on the matched set, with very large models pushing into the mid-90s | DeBERTa-V3-large (1.5B parameter variant) was the first model reported to surpass human performance on the [SuperGLUE](/wiki/superglue) leaderboard, scoring 90.3 in January 2021 versus the 89.8 human baseline [13] |
| FEVER (label only) | not reported by authors | 50.91% labelling without evidence; 31.87% labelling plus correct evidence at release [9] | Two-stage retrieve-then-classify pipelines have improved significantly since 2018 |

A caveat about LLMs: as Brown et al. (2020) reported, GPT-3 underperformed on ANLI compared to fine-tuned smaller transformers. Subsequent LLM technical reports (GPT-4, Claude 3, Gemini) have largely stopped reporting NLI metrics, partly because the in-distribution corpora are saturated and partly because adversarial corpora like ANLI are not the headline benchmark they once were.

## How is NLI included in major benchmark suites?

NLI is rarely evaluated in isolation any more. It is normally one component of a multi-task suite.

- **[GLUE benchmark](/wiki/glue_benchmark)** (Wang, Singh, Michael, Hill, Levy & Bowman, BlackboxNLP 2018; later ICLR 2019) bundles MNLI, RTE, QNLI and WNLI alongside other tasks like CoLA, SST-2, MRPC, STS-B and QQP [8]. Three of nine GLUE tasks are explicitly NLI, and a fourth (QNLI) was reformulated as entailment from SQuAD.
- **[SuperGLUE](/wiki/superglue)** (Wang, Pruksachatkun, Nangia, Singh, Michael, Hill, Levy & Bowman, NeurIPS 2019) was created when models started saturating GLUE. SuperGLUE keeps RTE and adds CommitmentBank (CB), which is also a small NLI corpus [19].
- **XTREME** (Hu et al. 2020) and **XGLUE** include XNLI for cross-lingual evaluation.
- **HELM** (Liang et al. 2022) and various LLM evaluation harnesses include ANLI as part of the reasoning suite.

## What is NLI used for beyond benchmarking?

NLI's three-way label is so general that it has been pressed into service across many parts of the NLP stack.

**Zero-shot text classification.** The most popular practical use of NLI today. The Hugging Face `zero-shot-classification` pipeline takes a piece of text and a list of candidate labels, builds hypotheses of the form `"This example is {label}."`, runs them through an NLI model (BART-large fine-tuned on MultiNLI by default), and ranks the labels by entailment probability [21]. Yin, Hay and Roth, who popularised the trick at EMNLP 2019, frame it as treating "the text to be labeled as the premise and label text strings as the hypothesis" [18].

**Question answering as entailment.** Many extractive and multiple-choice QA setups can be reframed as: does the passage entail the candidate answer? QNLI, derived from SQuAD, is a clean instance of this idea.

**Fact verification.** FEVER and its successors (FEVEROUS, MultiFC) frame claim checking as a retrieval step followed by an NLI step over the retrieved evidence [9].

**Hallucination detection.** Take the model's output as the hypothesis and the source document as the premise; if an NLI model says `not entailed`, flag the output as a likely hallucination. This is the core idea behind summarisation-faithfulness metrics like SummaC (Laban et al. 2022), FactCC (Kryscinski et al. 2020) and QAFactEval (Fabbri et al. 2022).

**Retrieval reranking.** Some retrieval pipelines use an NLI scorer to demote retrieved passages that contradict the query.

**Summarisation evaluation.** Beyond hallucination detection, NLI scores feed into evaluation suites that grade summaries on factual consistency rather than just ROUGE overlap.

## What are the known biases in NLI datasets?

The NLI community spent the second half of the 2010s discovering that big neural NLI models were exploiting shortcuts more than they were doing real inference.

**Annotation artefacts.** Gururangan et al.'s 2018 paper "Annotation Artifacts in Natural Language Inference Data" showed that a hypothesis-only classifier (one that never sees the premise) reaches well above the 33% chance baseline. The authors report: "we show that a simple text categorization model can correctly classify the hypothesis alone in about 67% of SNLI and 53% of MultiNLI." [17] A parallel study by Poliak et al. (2018), "Hypothesis Only Baselines in Natural Language Inference," reached the same conclusion across a range of NLI datasets, arguing that ignoring the premise should be "a degenerate solution" yet often is not [22]. The reason is that crowd-workers writing entailing, neutral and contradicting hypotheses tend to use systematic patterns: contradictions often contain explicit negation; entailments hedge with general words like "some" or "animal"; neutrals add specific extra detail. Models pick up these surface patterns and look much smarter than they are.

**Syntactic heuristics.** McCoy, Pavlick and Linzen's 2019 paper "Right for the Wrong Reasons" introduced HANS (Heuristic Analysis for NLI Systems), a controlled diagnostic that probes three syntactic heuristics: lexical overlap, subsequence and constituent [20]. BERT trained on MultiNLI failed badly on HANS, in many cases scoring close to 0% on the non-entailment examples, showing that strong in-domain accuracy did not imply real syntactic understanding [20].

**Adversarial collection.** ANLI was the community's response. By having human writers craft examples specifically to fool the current best model, Nie et al. produced a dataset where standard transformers struggle. Round 3 accuracies for fine-tuned RoBERTa-large variants sit around 44% [7], far below human accuracy of about 92%.

**Calibration and uncertainty.** NLI models are often over-confident. Recent work has explored confidence calibration, abstention, and the use of NLI as a soft-evidence scorer rather than a hard classifier, especially in safety-critical settings like medical entailment.

## Is NLI still relevant in the LLM era?

Three threads keep NLI alive in the LLM era.

First, large language models like GPT-4, Claude and Gemini do well on in-distribution NLI in zero shot, but they remain uneven on adversarial and specialised corpora. Lost-in-Inference style analyses (e.g., the 2024 paper "Lost in Inference: Rediscovering the Role of Natural Language Inference for Large Language Models") have argued that NLI is a useful diagnostic for what LLMs actually understand, even when standard NLI accuracy is no longer the headline number in model release papers.

Second, NLI is a workhorse for evaluation pipelines. Factual-consistency checks for summarisation, RAG hallucination detection, claim verification, and model-output verification all rely on NLI under the hood. Zero-shot classification in production systems still depends on NLI-fine-tuned encoder models such as `facebook/bart-large-mnli` because they are small, fast, and reliable [21].

Third, cross-lingual and multilingual NLI is an active area. XNLI is still the default benchmark for cross-lingual sentence understanding, and recent multilingual encoders like XLM-R, mDeBERTa-v3 and the multilingual variants of LaBSE and SBERT are evaluated on it routinely.

A quieter trend is the resurgence of logical and rule-based hybrids. Compositional generalisation work (for example, NaturalLogic-style monotonicity reasoners) has come back into fashion as people realise that neural NLI systems still struggle with quantifiers, negation scope and downward-entailing contexts. Hybrid systems combining symbolic monotonicity calculus with neural representations have been competitive on FraCaS-style benchmarks.

## Implementations

Most practitioners do not implement NLI from scratch. The dominant tooling is:

- **Hugging Face `transformers`.** The `pipeline("zero-shot-classification")` API uses `facebook/bart-large-mnli` by default and is the simplest way to apply NLI in a one-liner [21]. There are dozens of pre-fine-tuned NLI checkpoints on the Hub (`roberta-large-mnli`, `microsoft/deberta-v2-xxlarge-mnli`, `cross-encoder/nli-deberta-v3-large`, multilingual `MoritzLaurer/mDeBERTa-v3-base-mnli-xnli`, and more).
- **`sentence-transformers`.** Cross-encoder NLI models packaged as sentence-pair classifiers, useful when you want a single softmax over the three labels.
- **AllenNLP.** The original ESIM and decomposable-attention reference implementations live here, and AllenNLP's `pretrained` interface still serves them.
- **PyTorch and TensorFlow Hub.** Model files for older NLI baselines are still hosted, although they are mostly of historical interest now that transformer fine-tunes outperform them.
- **Datasets libraries.** `datasets.load_dataset("snli")`, `"multi_nli"`, `"anli"`, `"xnli"`, `"scitail"` and `"fever"` cover the major corpora with consistent splits; the canonical SNLI release and splits are distributed by the Stanford NLP Group [23].

A typical end-to-end flow looks like: load `multi_nli` from `datasets`, fine-tune `roberta-large` for three epochs with a batch size of 32 and a learning rate of 1e-5, evaluate on matched and mismatched dev sets, then optionally test transfer to ANLI and XNLI.

## References

1. Bowman, S. R., Angeli, G., Potts, C., & Manning, C. D. (2015). A large annotated corpus for learning natural language inference. *Proceedings of EMNLP 2015*. https://aclanthology.org/D15-1075/ (arXiv:1508.05326)
2. Williams, A., Nangia, N., & Bowman, S. R. (2018). A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. *NAACL-HLT 2018*. https://aclanthology.org/N18-1101/ (arXiv:1704.05426)
3. Dagan, I., Glickman, O., & Magnini, B. (2006). The PASCAL Recognising Textual Entailment Challenge. In *Machine Learning Challenges (LNCS 3944)*. Springer. https://link.springer.com/chapter/10.1007/11736790_9
4. Cooper, R., Crouch, R., van Eijck, J., Fox, C., van Genabith, J., Jaspars, J., Kamp, H., Milward, D., Pinkal, M., Poesio, M., & Pulman, S. (1996). FraCaS: A Framework for Computational Semantics, deliverable D16. (FraCaS test suite.)
5. Khot, T., Sabharwal, A., & Clark, P. (2018). SciTaiL: A Textual Entailment Dataset from Science Question Answering. *AAAI 2018*. https://ojs.aaai.org/index.php/AAAI/article/view/12022
6. Romanov, A., & Shivade, C. (2018). Lessons from Natural Language Inference in the Clinical Domain. *EMNLP 2018*. https://aclanthology.org/D18-1187/ (MedNLI)
7. Nie, Y., Williams, A., Dinan, E., Bansal, M., Weston, J., & Kiela, D. (2020). Adversarial NLI: A New Benchmark for Natural Language Understanding. *ACL 2020*. https://aclanthology.org/2020.acl-main.441/ (arXiv:1910.14599)
8. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. R. (2018). GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. *EMNLP BlackboxNLP Workshop 2018*; later ICLR 2019. https://aclanthology.org/W18-5446/ (arXiv:1804.07461)
9. Thorne, J., Vlachos, A., Christodoulopoulos, C., & Mittal, A. (2018). FEVER: a Large-scale Dataset for Fact Extraction and VERification. *NAACL 2018*. https://aclanthology.org/N18-1074/
10. Conneau, A., Lample, G., Rinott, R., Williams, A., Bowman, S. R., Schwenk, H., & Stoyanov, V. (2018). XNLI: Evaluating Cross-lingual Sentence Representations. *EMNLP 2018*. https://aclanthology.org/D18-1269/ (arXiv:1809.05053)
11. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. *NAACL-HLT 2019*. https://aclanthology.org/N19-1423/
12. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692
13. He, P., Liu, X., Gao, J., & Chen, W. (2021). DeBERTa: Decoding-enhanced BERT with Disentangled Attention. *ICLR 2021*. (Updated as DeBERTaV3, 2023.) Microsoft Research, "Microsoft DeBERTa surpasses human performance on the SuperGLUE benchmark," January 2021. https://www.microsoft.com/en-us/research/blog/microsoft-deberta-surpasses-human-performance-on-the-superglue-benchmark/
14. Parikh, A., Tackstrom, O., Das, D., & Uszkoreit, J. (2016). A Decomposable Attention Model for Natural Language Inference. *EMNLP 2016*. https://aclanthology.org/D16-1244/
15. Chen, Q., Zhu, X., Ling, Z.-H., Wei, S., Jiang, H., & Inkpen, D. (2017). Enhanced LSTM for Natural Language Inference. *ACL 2017*. https://arxiv.org/abs/1609.06038 (ESIM)
16. Rocktaschel, T., Grefenstette, E., Hermann, K. M., Kocisky, T., & Blunsom, P. (2016). Reasoning about Entailment with Neural Attention. *ICLR 2016*. https://arxiv.org/abs/1509.06664
17. Gururangan, S., Swayamdipta, S., Levy, O., Schwartz, R., Bowman, S. R., & Smith, N. A. (2018). Annotation Artifacts in Natural Language Inference Data. *NAACL 2018*. https://aclanthology.org/N18-2017/
18. Yin, W., Hay, J., & Roth, D. (2019). Benchmarking Zero-shot Text Classification: Datasets, Evaluation and Entailment Approach. *EMNLP-IJCNLP 2019*. https://aclanthology.org/D19-1404/
19. Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. R. (2019). SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. *NeurIPS 2019*. https://arxiv.org/abs/1905.00537
20. McCoy, R. T., Pavlick, E., & Linzen, T. (2019). Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference. *ACL 2019*. https://aclanthology.org/P19-1334/ (HANS)
21. Hugging Face. `facebook/bart-large-mnli` model card and `pipeline("zero-shot-classification")` documentation. https://huggingface.co/facebook/bart-large-mnli
22. Poliak, A., Naradowsky, J., Haldar, A., Rudinger, R., & Van Durme, B. (2018). Hypothesis Only Baselines in Natural Language Inference. *Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics (*SEM 2018)*. https://aclanthology.org/S18-2023/ (arXiv:1805.01042)
23. Stanford NLP Group. The Stanford Natural Language Inference (SNLI) Corpus. https://nlp.stanford.edu/projects/snli/