Natural Language Processing

Reading Comprehension

Understanding and answering questions about passages.

1 datasets2 resultsView full task mapping →

Reading comprehension tests a model's ability to answer questions about a given passage — the quintessential NLU evaluation. SQuAD launched the modern era, but benchmarks now span multi-hop reasoning (HotpotQA), conversational QA (CoQA), and adversarial probing (AdversarialQA). LLMs have saturated most benchmarks, shifting focus to harder multi-document and reasoning-intensive tasks.

History

2016

SQuAD 1.1 (Rajpurkar et al.) provides 100K+ extractive QA pairs; becomes the most cited NLP dataset

2017

BiDAF (Seo et al.) introduces bidirectional attention flow for passage comprehension

2018

BERT surpasses human F1 (93.2) on SQuAD 1.1, demonstrating that extractive RC is 'solved' for simple cases

2018

SQuAD 2.0 adds unanswerable questions — models must learn to abstain when no answer exists in the passage

2019

HotpotQA and MultiRC require multi-hop reasoning across multiple paragraphs

2019

CoQA and QuAC introduce conversational reading comprehension with follow-up questions

2020

UnifiedQA (Khashabi et al.) trains a single T5 model across 20+ RC datasets, showing format unification

2023

GPT-4 achieves near-perfect scores on SQuAD, CoQA, and NarrativeQA; focus shifts to harder benchmarks

2024

DROP (discrete reasoning over paragraphs) and IIRC (incomplete information) remain challenging for LLMs

How Reading Comprehension Works

Input encoding

Question and passage are concatenated and encoded by the transformer; cross-attention allows the model to focus on relevant spans

Span prediction (extractive)

Two linear heads predict start and end positions of the answer span within the passage

Answerability check

For SQuAD 2.0-style tasks, a separate head predicts whether the question is answerable from the given passage

Free-form generation (abstractive)

For generative RC, the model produces the answer token by token, grounded in the passage context

Current Landscape

Reading comprehension in 2025 is a mature evaluation paradigm where standard benchmarks (SQuAD, CoQA, NarrativeQA) are effectively saturated by frontier LLMs. The task remains valuable as a component of more complex systems — RAG pipelines are essentially reading comprehension at scale. Research has shifted to harder variants: multi-hop reasoning (HotpotQA, MuSiQue), discrete reasoning (DROP), and adversarial robustness. The extractive RC paradigm (selecting spans) is being replaced by generative RC (free-form answers with citations).

Key Challenges

Multi-hop reasoning: questions requiring information from 2+ disconnected paragraphs remain much harder than single-hop

Shortcut exploitation: models often answer from passage-question lexical overlap rather than genuine comprehension

Free-form answer evaluation: comparing generated answers to references is error-prone (correct but differently worded answers score poorly)

Long-document comprehension: passages exceeding context windows require chunking strategies that may miss relevant spans

Conversational context: maintaining coreference and dialogue state across multi-turn QA is unsolved

Quick Recommendations

Best overall RC

GPT-4o or Claude 3.5 Sonnet

Near-perfect on SQuAD, CoQA; strong on multi-hop and reasoning-intensive benchmarks

Extractive RC (production)

DeBERTa-v3-large fine-tuned on SQuAD 2.0

93+ F1; no hallucination risk since answers are spans from the passage

Multi-dataset RC

UnifiedQA-v2 (T5-based)

Single model handles extractive, abstractive, multiple-choice, and yes/no QA formats

Conversational RC

GPT-4o with dialogue history in context

Handles coreference resolution and follow-up questions naturally

What's Next

The future of reading comprehension is its integration into agentic and multi-document reasoning systems. Standalone passage-level RC will give way to corpus-level QA where models must find, read, and synthesize across thousands of documents. Expect evaluation to shift from F1 on extracted spans to faithfulness and attribution quality in generated answers.

Benchmarks & SOTA

RACE

ReAding Comprehension from Examinations

20172 results

Canonical multiple-choice reading comprehension benchmark built from English exams for Chinese middle and high school students. ~28K passages and ~100K questions. Evaluated as accuracy over RACE-M (middle) + RACE-H (high) combined.

State of the Art

ALBERT ensemble

89.4

accuracy

Related Tasks

Question Answering

Question answering now spans extractive reading comprehension, open-domain retrieval QA, multi-hop reasoning, factuality, long-context QA, and web-browsing agents. SQuAD is historical; current QA evaluation needs Natural Questions, TriviaQA, HotpotQA, MuSiQue, DROP, KILT, SimpleQA, FRAMES, and BrowseComp.

Polish LLM General

General-purpose evaluation of language models on Polish language tasks: sentiment, reading comprehension, question answering, cyberbullying detection, and emotional intelligence.

Natural Language Inference

Determining entailment relationships between sentences (SNLI, MNLI).

Text Ranking

Text ranking is the invisible backbone of every search engine and RAG pipeline. The field was transformed by ColBERT (2020) introducing late interaction, then by instruction-tuned embedding models like E5-Mistral and GTE-Qwen that turned general LLMs into retrieval engines. MS MARCO and BEIR remain the standard battlegrounds, but the real test is zero-shot transfer — can a model trained on web search generalize to legal documents, scientific papers, and code? The gap between supervised and zero-shot performance has shrunk from 15+ points to under 3 in two years.

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Reading Comprehension benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to Natural Language Processing