Natural Language Processing

Natural Language Inference

Determining entailment relationships between sentences (SNLI, MNLI).

1 datasets8 resultsView full task mapping →

Natural language inference (NLI) determines whether a hypothesis is entailed by, contradicts, or is neutral to a premise. It's both a standalone task and a critical building block for zero-shot classification, fact verification, and textual reasoning. DeBERTa-v3 holds the top spot on MNLI, while LLMs handle NLI implicitly in their broader reasoning.

History

2015

SNLI (Bowman et al.) provides 570K human-labeled premise-hypothesis pairs — the first large-scale NLI dataset

2017

MultiNLI (Williams et al.) extends NLI to 10 genres with 433K pairs; becomes a core GLUE task

2018

BERT achieves 86.7% on MNLI, establishing transformers as the NLI paradigm

2019

RoBERTa pushes MNLI to 90.2% with improved pretraining; adversarial NLI (ANLI) exposes remaining weaknesses

2020

DeBERTa introduces disentangled attention, surpassing human performance on MNLI at 91.1%

2021

NLI is repurposed for zero-shot classification (Yin et al.) — entailment probability as label confidence

2022

ANLI remains unsolved at ~60% for GPT-3 scale models, showing adversarial robustness is still lacking

2024

GPT-4o achieves ~92% on MNLI zero-shot; DeBERTa-v3-large remains the fine-tuned SOTA at 91.9%

How Natural Language Inference Works

Input formatting

Premise and hypothesis are concatenated with a [SEP] token: '[CLS] premise [SEP] hypothesis [SEP]'

Joint encoding

The transformer processes both texts jointly, allowing cross-attention between premise and hypothesis tokens

Classification

The [CLS] representation is fed to a 3-way classifier: entailment, contradiction, or neutral

Probability calibration

Softmax outputs are calibrated to produce reliable confidence scores for downstream use (e.g., zero-shot classification)

Current Landscape

NLI in 2025 is a mature benchmark task where MNLI is effectively solved (>91% accuracy, above human baseline). The real impact of NLI research is downstream: NLI-trained models power zero-shot classification (BART-MNLI), fact verification, and textual entailment checks in RAG pipelines. ANLI remains the hard benchmark, exposing that models still lack robust logical reasoning. The field has shifted focus from standalone NLI accuracy to using NLI as a reasoning primitive within larger systems.

Key Challenges

Annotation artifacts: models exploit spurious correlations (e.g., 'not' signals contradiction) without genuine reasoning

Adversarial robustness: ANLI shows that human-written adversarial examples defeat most models

Fine-grained entailment: soft entailment ('mostly true') and graded similarity aren't captured by 3-class labels

Domain transfer: NLI models trained on general text degrade on scientific, legal, and medical premise-hypothesis pairs

Compositionality: multi-sentence premises with complex logical structure remain challenging

Quick Recommendations

Best fine-tuned NLI

DeBERTa-v3-large fine-tuned on MNLI + SNLI

91.9% on MNLI matched; best encoder model for NLI and zero-shot classification

Zero-shot NLI

GPT-4o or Claude 3.5 Sonnet

~92% on MNLI without fine-tuning; handles complex multi-sentence reasoning

Fact verification

DeBERTa + FEVER-trained classifier

NLI models fine-tuned on fact verification data detect unsupported claims

Lightweight NLI

MiniLM-L12 fine-tuned on MNLI

33M params with 87%+ accuracy; fast enough for real-time applications

What's Next

Expect NLI to be absorbed into general reasoning evaluation rather than tracked as a standalone task. The technique of using entailment as a building block for zero-shot classification, fact-checking, and claim verification will persist and deepen. Adversarial NLI (and harder versions like those in BIG-Bench) will continue to test whether models genuinely reason or merely pattern-match.

Benchmarks & SOTA

SNLI

Stanford Natural Language Inference

20158 results

570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral.

State of the Art

GPT-4o

OpenAI

92.6

accuracy

Related Tasks

Question Answering

Question answering now spans extractive reading comprehension, open-domain retrieval QA, multi-hop reasoning, factuality, long-context QA, and web-browsing agents. SQuAD is historical; current QA evaluation needs Natural Questions, TriviaQA, HotpotQA, MuSiQue, DROP, KILT, SimpleQA, FRAMES, and BrowseComp.

Polish LLM General

General-purpose evaluation of language models on Polish language tasks: sentiment, reading comprehension, question answering, cyberbullying detection, and emotional intelligence.

Reading Comprehension

Understanding and answering questions about passages.

Text Ranking

Text ranking is the invisible backbone of every search engine and RAG pipeline. The field was transformed by ColBERT (2020) introducing late interaction, then by instruction-tuned embedding models like E5-Mistral and GTE-Qwen that turned general LLMs into retrieval engines. MS MARCO and BEIR remain the standard battlegrounds, but the real test is zero-shot transfer — can a model trained on web search generalize to legal documents, scientific papers, and code? The gap between supervised and zero-shot performance has shrunk from 15+ points to under 3 in two years.

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Natural Language Inference benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to Natural Language Processing