Natural Language Processing

Natural Language Inference

Determining entailment relationships between sentences (SNLI, MNLI).

1 datasets8 resultsView full task mapping →

Natural language inference (NLI) determines whether a hypothesis is entailed by, contradicts, or is neutral to a premise. It's both a standalone task and a critical building block for zero-shot classification, fact verification, and textual reasoning. DeBERTa-v3 holds the top spot on MNLI, while LLMs handle NLI implicitly in their broader reasoning.

History

2015

SNLI (Bowman et al.) provides 570K human-labeled premise-hypothesis pairs — the first large-scale NLI dataset

2017

MultiNLI (Williams et al.) extends NLI to 10 genres with 433K pairs; becomes a core GLUE task

2018

BERT achieves 86.7% on MNLI, establishing transformers as the NLI paradigm

2019

RoBERTa pushes MNLI to 90.2% with improved pretraining; adversarial NLI (ANLI) exposes remaining weaknesses

2020

DeBERTa introduces disentangled attention, surpassing human performance on MNLI at 91.1%

2021

NLI is repurposed for zero-shot classification (Yin et al.) — entailment probability as label confidence

2022

ANLI remains unsolved at ~60% for GPT-3 scale models, showing adversarial robustness is still lacking

2024

GPT-4o achieves ~92% on MNLI zero-shot; DeBERTa-v3-large remains the fine-tuned SOTA at 91.9%

How Natural Language Inference Works

1Input formattingPremise and hypothesis are …2Joint encodingThe transformer processes b…3ClassificationThe [CLS] representation is…4Probability calibrati…Softmax outputs are calibra…Natural Language Inference Pipeline
1

Input formatting

Premise and hypothesis are concatenated with a [SEP] token: '[CLS] premise [SEP] hypothesis [SEP]'

2

Joint encoding

The transformer processes both texts jointly, allowing cross-attention between premise and hypothesis tokens

3

Classification

The [CLS] representation is fed to a 3-way classifier: entailment, contradiction, or neutral

4

Probability calibration

Softmax outputs are calibrated to produce reliable confidence scores for downstream use (e.g., zero-shot classification)

Current Landscape

NLI in 2025 is a mature benchmark task where MNLI is effectively solved (>91% accuracy, above human baseline). The real impact of NLI research is downstream: NLI-trained models power zero-shot classification (BART-MNLI), fact verification, and textual entailment checks in RAG pipelines. ANLI remains the hard benchmark, exposing that models still lack robust logical reasoning. The field has shifted focus from standalone NLI accuracy to using NLI as a reasoning primitive within larger systems.

Key Challenges

Annotation artifacts: models exploit spurious correlations (e.g., 'not' signals contradiction) without genuine reasoning

Adversarial robustness: ANLI shows that human-written adversarial examples defeat most models

Fine-grained entailment: soft entailment ('mostly true') and graded similarity aren't captured by 3-class labels

Domain transfer: NLI models trained on general text degrade on scientific, legal, and medical premise-hypothesis pairs

Compositionality: multi-sentence premises with complex logical structure remain challenging

Quick Recommendations

Best fine-tuned NLI

DeBERTa-v3-large fine-tuned on MNLI + SNLI

91.9% on MNLI matched; best encoder model for NLI and zero-shot classification

Zero-shot NLI

GPT-4o or Claude 3.5 Sonnet

~92% on MNLI without fine-tuning; handles complex multi-sentence reasoning

Fact verification

DeBERTa + FEVER-trained classifier

NLI models fine-tuned on fact verification data detect unsupported claims

Lightweight NLI

MiniLM-L12 fine-tuned on MNLI

33M params with 87%+ accuracy; fast enough for real-time applications

What's Next

Expect NLI to be absorbed into general reasoning evaluation rather than tracked as a standalone task. The technique of using entailment as a building block for zero-shot classification, fact-checking, and claim verification will persist and deepen. Adversarial NLI (and harder versions like those in BIG-Bench) will continue to test whether models genuinely reason or merely pattern-match.

Benchmarks & SOTA

Related Tasks

Reading Comprehension

Understanding and answering questions about passages.

Text Ranking

Text ranking is the invisible backbone of every search engine and RAG pipeline. The field was transformed by ColBERT (2020) introducing late interaction, then by instruction-tuned embedding models like E5-Mistral and GTE-Qwen that turned general LLMs into retrieval engines. MS MARCO and BEIR remain the standard battlegrounds, but the real test is zero-shot transfer — can a model trained on web search generalize to legal documents, scientific papers, and code? The gap between supervised and zero-shot performance has shrunk from 15+ points to under 3 in two years.

Table Question Answering

Table question answering bridges natural language and structured data — asking "what was Q3 revenue?" over a spreadsheet and getting the right cell or computed answer. Google's TAPAS (2020) pioneered joint table-text pre-training, and TAPEX trained on synthetic SQL execution traces to teach models tabular reasoning. The field shifted dramatically when GPT-4 and Claude demonstrated they could reason over tables in-context without any table-specific fine-tuning, often matching or beating specialized models on WikiTableQuestions and SQA. The hard frontier is multi-step numerical reasoning over large tables with hundreds of rows — exactly the kind of task where tool-augmented LLMs that generate and execute code are pulling ahead of pure neural approaches.

Question Answering

Extractive and abstractive question answering is one of the oldest NLP benchmarks, from the original SQuAD (2016) to the adversarial complexity of Natural Questions and TriviaQA. Human parity on SQuAD 2.0 was claimed by ALBERT in 2020, effectively saturating the benchmark — but real-world QA over noisy documents, multi-hop reasoning (HotpotQA, MuSiQue), and long-context grounding remain far from solved. The paradigm has shifted from standalone QA models to retrieval-augmented generation (RAG), where the bottleneck moved from answer extraction to retrieval quality. Modern systems like Perplexity and Google's AI Overviews show that production QA is now an end-to-end pipeline problem, not a single-model benchmark.

Something wrong or missing?

Help keep Natural Language Inference benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000