Natural language inference (NLI)
Last reviewed
Apr 30, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,741 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 30, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,741 words
Add missing citations, update stale details, or suggest a clearer explanation.
Natural language inference (NLI), also known as recognising textual entailment (RTE), is the task of determining whether a hypothesis sentence is entailed by, contradicted by, or neutral with respect to a premise sentence. Given a pair of natural-language sentences (premise P, hypothesis H), an NLI system assigns one of three labels: entailment (H follows from P), contradiction (H is incompatible with P), or neutral (the premise gives no clear evidence either way). The task was crystallised in its modern form by Dagan, Glickman & Magnini in the 2006 PASCAL Recognising Textual Entailment Challenge and was scaled into the deep-learning era by Bowman, Angeli, Potts & Manning's 2015 SNLI corpus.
NLI sits at the centre of natural language understanding evaluation. Whereas surface-level tasks such as part-of-speech tagging or named-entity recognition can be solved with mostly local features, deciding whether one sentence entails another forces a model to handle synonymy, paraphrase, lexical entailment relations, negation, quantifiers, coreference, and a healthy amount of world knowledge. That is why NLI shows up as a sub-task in nearly every general language understanding benchmark, from GLUE benchmark and SuperGLUE through to multilingual suites like XNLI and XTREME.
The formal version of the task is straightforward. The model receives a pair (premise, hypothesis) and outputs a label from {entailment, contradiction, neutral}. SNLI, MultiNLI, ANLI and XNLI all use this three-way label set. Some earlier and more specialised datasets (RTE-1 through RTE-3, SciTail, FEVER) collapse the label space to two classes (entailment vs. not-entailment, or supports vs. refutes vs. not-enough-info). The classic three-way examples from the SNLI paper illustrate what each label means in practice.
| Label | Premise | Hypothesis | Why |
|---|---|---|---|
| Entailment | A man inspects the uniform of a figure. | The man is inspecting his uniform. | Inspecting a uniform that belongs to a figure he is associated with implies inspecting his uniform in the everyday reading of the sentence. |
| Contradiction | A man is playing guitar. | The man is sleeping. | Playing guitar and sleeping cannot both be true of the same man at the same time. |
| Neutral | A boy is jumping on a skateboard in the middle of a red bridge. | The boy does a skateboarding trick. | The premise is consistent with a trick but does not state that one is being performed. |
The definition of entailment used in NLI corpora is loose by design. It is the everyday, common-sense reading rather than strict logical entailment. Bowman and colleagues describe SNLI labelling instructions in terms of "a likely description" of the same scene, which means models are expected to make the kind of pragmatic inferences a fluent reader would make, not the airtight deductions a theorem prover would.
NLI is a strong proxy for general language understanding because so many other tasks reduce to it. Question answering can be cast as: does the passage entail the candidate answer when slotted into a question template? Fact verification asks whether a knowledge base supports or refutes a claim, which is the same labelling problem with a different name. Summarisation evaluation increasingly uses NLI to check whether a summary is entailed by its source document, surfacing hallucinations. Retrieval-augmented generation (RAG) pipelines use NLI to filter or score retrieved evidence. Yin, Hay and Roth's 2019 paper "Benchmarking Zero-shot Text Classification" showed that you can even cast topic classification, emotion classification and many other label-prediction problems as entailment by templating the candidate label into a hypothesis like "This text is about politics." That formulation underlies the popular zero-shot-classification pipeline in the Hugging Face transformers library, which loads a BART model fine-tuned on MultiNLI by default.
In short, if a model can do NLI well, it can be repurposed for a long list of downstream classification problems with no further training.
The lineage of NLI starts before the deep-learning era and stretches back to formal-semantics work in computational linguistics.
FraCaS test suite (Cooper et al. 1996). The European FraCaS project (Framework for Computational Semantics) produced a small, hand-crafted suite of 346 inference problems, each consisting of one or more premises followed by a yes/no/unknown question. The examples were designed to probe specific phenomena like generalised quantifiers, plurals, anaphora and tense. FraCaS predates RTE by a decade and remains a useful diagnostic for compositional inference, especially for symbolic systems.
PASCAL RTE Challenges (Dagan, Glickman & Magnini 2006). The PASCAL Network of Excellence ran a series of Recognising Textual Entailment challenges starting with RTE-1 in 2005. Each challenge released a few hundred premise-hypothesis pairs drawn from real applications such as information extraction, question answering and summarisation, with binary entailment labels. Seventeen teams submitted to RTE-1, and the series ran through RTE-7. The PASCAL RTE work established "is hypothesis H entailed by text T?" as a competitive benchmark and gave the field its name (RTE), which is still used interchangeably with NLI.
SNLI (Bowman, Angeli, Potts & Manning 2015). SNLI was the breakthrough that made data-hungry neural models viable for entailment. The authors collected 570k English sentence pairs by showing crowd workers a Flickr 30k image caption (the premise) and asking them to write three new captions: one entailed, one contradicted, one neutral. The result was a dataset roughly two orders of magnitude larger than anything that had come before. SNLI was published at EMNLP 2015 (arXiv:1508.05326).
MultiNLI (Williams, Nangia & Bowman 2018). MultiNLI extended SNLI's recipe across ten genres of written and spoken English, including fiction, government documents, telephone speech, travel guides and the 9/11 Commission report. The result was 433k pairs split into two test sets: "matched" (genres seen at training time) and "mismatched" (held-out genres). MultiNLI was published at NAACL 2018 (arXiv:1704.05426).
Specialised datasets. SciTail (Khot, Sabharwal & Clark, AAAI 2018) was built from real science-exam questions and web sentences, producing 27k pairs labelled entails or neutral. MedNLI (Romanov & Shivade, EMNLP 2018) drew premises from the MIMIC-III clinical-notes corpus and used physician annotators to label 14k pairs in the medical domain.
Adversarial NLI (Nie, Williams, Dinan, Bansal, Weston & Kiela, ACL 2020). ANLI was collected via a human-and-model-in-the-loop procedure. Annotators wrote hypotheses that fooled a current state-of-the-art NLI model, then those examples became the training set for a stronger model, which was then attacked again. Three rounds (R1, R2, R3) gave a dataset of about 162k examples that progressively defeats successive generations of models. ANLI is built specifically to push past the saturation point of SNLI and MultiNLI.
XNLI (Conneau et al. 2018). XNLI translated MultiNLI's development and test sets into 14 languages: French, Spanish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili and Urdu. The training set is still English MultiNLI, so XNLI tests cross-lingual transfer rather than monolingual modelling. It became the standard benchmark for multilingual sentence encoders.
| Dataset | Year | Size | Source domain | Labels | Notable |
|---|---|---|---|---|---|
| FraCaS | 1996 | 346 problems | Hand-crafted, formal semantics | yes/no/unknown | Tests compositional phenomena (quantifiers, plurals, anaphora) |
| RTE-1 to RTE-7 | 2005 to 2011 | A few hundred to a few thousand pairs per round | News, IE, QA outputs | Binary (entailment / not) | Gave RTE its name; sub-task in GLUE benchmark and SuperGLUE |
| SNLI | 2015 | 570k pairs | Flickr 30k captions | entailment, contradiction, neutral | First dataset large enough for deep neural training |
| MultiNLI | 2018 | 433k pairs | 10 genres of written and spoken English | entailment, contradiction, neutral | Matched and mismatched test sets for cross-genre evaluation |
| SciTail | 2018 | 27k pairs (10k entails, 17k neutral) | Science exam questions and web text | entails, neutral | First NLI built entirely from naturally occurring sentences |
| MedNLI | 2018 | 14k pairs | MIMIC-III clinical notes | entailment, contradiction, neutral | Annotated by physicians; clinical-domain probe |
| FEVER | 2018 | 185k claims | Wikipedia, with evidence sentences | Supports, Refutes, NotEnoughInfo | Couples claim verification with retrieval of evidence |
| ANLI | 2020 | ~162k pairs across R1, R2, R3 | Wikipedia and other corpora | entailment, contradiction, neutral | Iteratively built to defeat current models |
| XNLI | 2018 | 7.5k dev/test pairs in each of 14 languages | MultiNLI translated | entailment, contradiction, neutral | Standard cross-lingual NLI benchmark |
| QNLI (in GLUE benchmark) | 2018 | ~110k pairs | Derived from SQuAD | entailment, not_entailment | Question / sentence pairs reformulated as entailment |
| WNLI (in GLUE benchmark) | 2018 | 634 train / 71 dev pairs | Winograd Schema Challenge | entailment, not_entailment | Tiny but coreference-focused; notoriously tricky to beat majority baseline |
| CB (in SuperGLUE) | 2019 | 250 train / 56 val pairs | CommitmentBank (embedded clauses) | entailment, contradiction, neutral | Probes inference over embedded clauses with reporting verbs |
The history of NLI methods tracks the broader history of NLP. Early systems leaned on lexical and syntactic alignment with hand-crafted features. The arrival of SNLI made end-to-end neural models practical. Pre-trained transformer-based models then collapsed most of the gap to human performance on the in-domain corpora. Large language models (LLMs) handle in-distribution NLI in zero shot, although adversarial benchmarks remain difficult.
| Era | Representative methods | What they did | SNLI test accuracy (approx.) |
|---|---|---|---|
| Pre-deep-learning (RTE era) | Lexical overlap, alignment, hand-crafted features (e.g., MacCartney & Manning's NatLog) | Compute alignment between hypothesis and premise tokens, with rule-based handling of negation and quantifier monotonicity | RTE-1/2/3 binary accuracies in the high 50s to low 60s for top systems |
| Early neural | LSTM-with-attention (Rocktäschel et al. 2016), LSTM-based reading models (Cheng et al. 2016) | Encode premise and hypothesis with LSTMs, attend across them | Around 83% on SNLI |
| Decomposable Attention (Parikh, Tackstrom, Das & Uszkoreit, EMNLP 2016) | Cross-attention between premise and hypothesis, no LSTM | Showed that attention alone, without word-order modelling, can do NLI | 86.3% on SNLI |
| ESIM (Chen, Zhu, Ling, Wei, Jiang & Inkpen, ACL 2017) | BiLSTMs with explicit local inference and composition layers | Strong sequence model that beat decomposable attention | 88.6% on SNLI |
| Pre-trained transformers (since 2018) | BERT, RoBERTa, DeBERTa, ALBERT, ELECTRA | Fine-tune a self-supervised transformer on the NLI training set | High 90s on SNLI; ~90 to 91% on MultiNLI matched for base-size models |
| Large LLMs (since 2020) | GPT-3, GPT-4, Claude, Gemini | Zero- or few-shot prompting; no task-specific fine-tuning required for in-distribution NLI | Strong on SNLI/MNLI in zero shot; mixed on ANLI |
The transition from feature-engineered systems to neural networks happened almost entirely on SNLI's back. By the time MultiNLI was released, ESIM and decomposable attention were the standard baselines. By late 2018, BERT had pushed past 90% on MultiNLI matched, and successor models such as RoBERTa (Liu et al. 2019) and DeBERTa (He et al. 2021) widened the lead.
Numbers in NLI papers vary by version of the test set and by whether the model is fine-tuned or evaluated zero-shot. The figures below come from the corresponding original papers and benchmark leaderboards as last published.
| Benchmark | Human accuracy | Best reported model accuracy | Notes |
|---|---|---|---|
| SNLI test | ~88% (Bowman et al. 2015) | 93.1% (EFL + RoBERTa-large, Wang et al. 2021) | Saturation has been near for several years |
| MultiNLI matched | ~92% (Williams, Nangia & Bowman 2018) | ~91 to 92% for DeBERTa-v3-large; high 90s for very large models on the GLUE leaderboard | Mismatched scores are typically within 1 point |
| ANLI R1 | ~92% on R1 according to Nie et al. 2020 | Around 75% (InfoBERT, RoBERTa-large) | Fine-tuned models trained on SNLI+MNLI+FEVER+ANLI |
| ANLI R2 | similar human range | ~58% (ALBERT) | R2 is harder by construction |
| ANLI R3 | similar human range | ~53% (ALBERT) | Still well below human accuracy |
| GLUE MNLI | 92.0% human | 91.9% (DeBERTa-V3 base) on the matched set, with very large models pushing into the mid-90s | DeBERTa-V3-large (1.5B parameter variant) was the first model reported to surpass human performance on the SuperGLUE leaderboard |
| FEVER (label only) | not reported by authors | ~50.9% labelling without evidence; ~31.9% labelling plus correct evidence at release | Two-stage retrieve-then-classify pipelines have improved significantly since 2018 |
A caveat about LLMs: as Brown et al. (2020) reported, GPT-3 underperformed on ANLI compared to fine-tuned smaller transformers. Subsequent LLM technical reports (GPT-4, Claude 3, Gemini) have largely stopped reporting NLI metrics, partly because the in-distribution corpora are saturated and partly because adversarial corpora like ANLI are not the headline benchmark they once were.
NLI is rarely evaluated in isolation any more. It is normally one component of a multi-task suite.
NLI's three-way label is so general that it has been pressed into service across many parts of the NLP stack.
Zero-shot text classification. The most popular practical use of NLI today. The Hugging Face zero-shot-classification pipeline takes a piece of text and a list of candidate labels, builds hypotheses of the form "This example is {label}.", runs them through an NLI model (BART-large fine-tuned on MultiNLI by default), and ranks the labels by entailment probability. The trick was popularised by Yin, Hay and Roth's 2019 EMNLP paper.
Question answering as entailment. Many extractive and multiple-choice QA setups can be reframed as: does the passage entail the candidate answer? QNLI, derived from SQuAD, is a clean instance of this idea.
Fact verification. FEVER and its successors (FEVEROUS, MultiFC) frame claim checking as a retrieval step followed by an NLI step over the retrieved evidence.
Hallucination detection. Take the model's output as the hypothesis and the source document as the premise; if an NLI model says not entailed, flag the output as a likely hallucination. This is the core idea behind summarisation-faithfulness metrics like SummaC (Laban et al. 2022), FactCC (Kryscinski et al. 2020) and QAFactEval (Fabbri et al. 2022).
Retrieval reranking. Some retrieval pipelines use an NLI scorer to demote retrieved passages that contradict the query.
Summarisation evaluation. Beyond hallucination detection, NLI scores feed into evaluation suites that grade summaries on factual consistency rather than just ROUGE overlap.
The NLI community spent the second half of the 2010s discovering that big neural NLI models were exploiting shortcuts more than they were doing real inference.
Annotation artefacts. Gururangan et al.'s 2018 paper "Annotation Artifacts in Natural Language Inference Data" showed that a hypothesis-only classifier (one that never sees the premise) reaches about 67% accuracy on SNLI and 53% on MultiNLI, far above the 33% chance baseline. The reason is that crowd-workers writing entailing, neutral and contradicting hypotheses tend to use systematic patterns: contradictions often contain explicit negation; entailments hedge with general words like "some" or "animal"; neutrals add specific extra detail. Models pick up these surface patterns and look much smarter than they are.
Syntactic heuristics. McCoy, Pavlick and Linzen's 2019 paper "Right for the Wrong Reasons" introduced HANS (Heuristic Analysis for NLI Systems), a controlled diagnostic that probes three syntactic heuristics: lexical overlap, subsequence and constituent. BERT trained on MultiNLI failed badly on HANS, showing that strong in-domain accuracy did not imply real syntactic understanding.
Adversarial collection. ANLI was the community's response. By having human writers craft examples specifically to fool the current best model, Nie et al. produced a dataset where standard transformers struggle. Round 3 accuracies for fine-tuned RoBERTa-large variants sit around 50%, far below human accuracy of about 92%.
Calibration and uncertainty. NLI models are often over-confident. Recent work has explored confidence calibration, abstention, and the use of NLI as a soft-evidence scorer rather than a hard classifier, especially in safety-critical settings like medical entailment.
Three threads keep NLI alive in the LLM era.
First, large language models like GPT-4, Claude and Gemini do well on in-distribution NLI in zero shot, but they remain uneven on adversarial and specialised corpora. Lost-in-Inference style analyses (e.g., the 2024 paper "Lost in Inference: Rediscovering the Role of Natural Language Inference for Large Language Models") have argued that NLI is a useful diagnostic for what LLMs actually understand, even when standard NLI accuracy is no longer the headline number in model release papers.
Second, NLI is a workhorse for evaluation pipelines. Factual-consistency checks for summarisation, RAG hallucination detection, claim verification, and model-output verification all rely on NLI under the hood. Zero-shot classification in production systems still depends on NLI-fine-tuned encoder models such as facebook/bart-large-mnli because they are small, fast, and reliable.
Third, cross-lingual and multilingual NLI is an active area. XNLI is still the default benchmark for cross-lingual sentence understanding, and recent multilingual encoders like XLM-R, mDeBERTa-v3 and the multilingual variants of LaBSE and SBERT are evaluated on it routinely.
A quieter trend is the resurgence of logical and rule-based hybrids. Compositional generalisation work (for example, NaturalLogic-style monotonicity reasoners) has come back into fashion as people realise that neural NLI systems still struggle with quantifiers, negation scope and downward-entailing contexts. Hybrid systems combining symbolic monotonicity calculus with neural representations have been competitive on FraCaS-style benchmarks.
Most practitioners do not implement NLI from scratch. The dominant tooling is:
transformers. The pipeline("zero-shot-classification") API uses facebook/bart-large-mnli by default and is the simplest way to apply NLI in a one-liner. There are dozens of pre-fine-tuned NLI checkpoints on the Hub (roberta-large-mnli, microsoft/deberta-v2-xxlarge-mnli, cross-encoder/nli-deberta-v3-large, multilingual MoritzLaurer/mDeBERTa-v3-base-mnli-xnli, and more).sentence-transformers. Cross-encoder NLI models packaged as sentence-pair classifiers, useful when you want a single softmax over the three labels.pretrained interface still serves them.datasets.load_dataset("snli"), "multi_nli", "anli", "xnli", "scitail" and "fever" cover the major corpora with consistent splits.A typical end-to-end flow looks like: load multi_nli from datasets, fine-tune roberta-large for three epochs with a batch size of 32 and a learning rate of 1e-5, evaluate on matched and mismatched dev sets, then optionally test transfer to ANLI and XNLI.