Reading Comprehension
Understanding and answering questions about passages.
Reading comprehension tests a model's ability to answer questions about a given passage — the quintessential NLU evaluation. SQuAD launched the modern era, but benchmarks now span multi-hop reasoning (HotpotQA), conversational QA (CoQA), and adversarial probing (AdversarialQA). LLMs have saturated most benchmarks, shifting focus to harder multi-document and reasoning-intensive tasks.
History
SQuAD 1.1 (Rajpurkar et al.) provides 100K+ extractive QA pairs; becomes the most cited NLP dataset
BiDAF (Seo et al.) introduces bidirectional attention flow for passage comprehension
BERT surpasses human F1 (93.2) on SQuAD 1.1, demonstrating that extractive RC is 'solved' for simple cases
SQuAD 2.0 adds unanswerable questions — models must learn to abstain when no answer exists in the passage
HotpotQA and MultiRC require multi-hop reasoning across multiple paragraphs
CoQA and QuAC introduce conversational reading comprehension with follow-up questions
UnifiedQA (Khashabi et al.) trains a single T5 model across 20+ RC datasets, showing format unification
GPT-4 achieves near-perfect scores on SQuAD, CoQA, and NarrativeQA; focus shifts to harder benchmarks
DROP (discrete reasoning over paragraphs) and IIRC (incomplete information) remain challenging for LLMs
How Reading Comprehension Works
Input encoding
Question and passage are concatenated and encoded by the transformer; cross-attention allows the model to focus on relevant spans
Span prediction (extractive)
Two linear heads predict start and end positions of the answer span within the passage
Answerability check
For SQuAD 2.0-style tasks, a separate head predicts whether the question is answerable from the given passage
Free-form generation (abstractive)
For generative RC, the model produces the answer token by token, grounded in the passage context
Current Landscape
Reading comprehension in 2025 is a mature evaluation paradigm where standard benchmarks (SQuAD, CoQA, NarrativeQA) are effectively saturated by frontier LLMs. The task remains valuable as a component of more complex systems — RAG pipelines are essentially reading comprehension at scale. Research has shifted to harder variants: multi-hop reasoning (HotpotQA, MuSiQue), discrete reasoning (DROP), and adversarial robustness. The extractive RC paradigm (selecting spans) is being replaced by generative RC (free-form answers with citations).
Key Challenges
Multi-hop reasoning: questions requiring information from 2+ disconnected paragraphs remain much harder than single-hop
Shortcut exploitation: models often answer from passage-question lexical overlap rather than genuine comprehension
Free-form answer evaluation: comparing generated answers to references is error-prone (correct but differently worded answers score poorly)
Long-document comprehension: passages exceeding context windows require chunking strategies that may miss relevant spans
Conversational context: maintaining coreference and dialogue state across multi-turn QA is unsolved
Quick Recommendations
Best overall RC
GPT-4o or Claude 3.5 Sonnet
Near-perfect on SQuAD, CoQA; strong on multi-hop and reasoning-intensive benchmarks
Extractive RC (production)
DeBERTa-v3-large fine-tuned on SQuAD 2.0
93+ F1; no hallucination risk since answers are spans from the passage
Multi-dataset RC
UnifiedQA-v2 (T5-based)
Single model handles extractive, abstractive, multiple-choice, and yes/no QA formats
Conversational RC
GPT-4o with dialogue history in context
Handles coreference resolution and follow-up questions naturally
What's Next
The future of reading comprehension is its integration into agentic and multi-document reasoning systems. Standalone passage-level RC will give way to corpus-level QA where models must find, read, and synthesize across thousands of documents. Expect evaluation to shift from F1 on extracted spans to faithfulness and attribution quality in generated answers.
Benchmarks & SOTA
No datasets indexed for this task yet.
Contribute on GitHubRelated Tasks
Natural Language Inference
Determining entailment relationships between sentences (SNLI, MNLI).
Text Ranking
Text ranking is the invisible backbone of every search engine and RAG pipeline. The field was transformed by ColBERT (2020) introducing late interaction, then by instruction-tuned embedding models like E5-Mistral and GTE-Qwen that turned general LLMs into retrieval engines. MS MARCO and BEIR remain the standard battlegrounds, but the real test is zero-shot transfer — can a model trained on web search generalize to legal documents, scientific papers, and code? The gap between supervised and zero-shot performance has shrunk from 15+ points to under 3 in two years.
Table Question Answering
Table question answering bridges natural language and structured data — asking "what was Q3 revenue?" over a spreadsheet and getting the right cell or computed answer. Google's TAPAS (2020) pioneered joint table-text pre-training, and TAPEX trained on synthetic SQL execution traces to teach models tabular reasoning. The field shifted dramatically when GPT-4 and Claude demonstrated they could reason over tables in-context without any table-specific fine-tuning, often matching or beating specialized models on WikiTableQuestions and SQA. The hard frontier is multi-step numerical reasoning over large tables with hundreds of rows — exactly the kind of task where tool-augmented LLMs that generate and execute code are pulling ahead of pure neural approaches.
Question Answering
Extractive and abstractive question answering is one of the oldest NLP benchmarks, from the original SQuAD (2016) to the adversarial complexity of Natural Questions and TriviaQA. Human parity on SQuAD 2.0 was claimed by ALBERT in 2020, effectively saturating the benchmark — but real-world QA over noisy documents, multi-hop reasoning (HotpotQA, MuSiQue), and long-context grounding remain far from solved. The paradigm has shifted from standalone QA models to retrieval-augmented generation (RAG), where the bottleneck moved from answer extraction to retrieval quality. Modern systems like Perplexity and Google's AI Overviews show that production QA is now an end-to-end pipeline problem, not a single-model benchmark.
Something wrong or missing?
Help keep Reading Comprehension benchmarks accurate. Report outdated results, missing benchmarks, or errors.