Natural Language Processing

Reading Comprehension

Understanding and answering questions about passages.

0 datasets0 resultsView full task mapping →

Reading comprehension tests a model's ability to answer questions about a given passage — the quintessential NLU evaluation. SQuAD launched the modern era, but benchmarks now span multi-hop reasoning (HotpotQA), conversational QA (CoQA), and adversarial probing (AdversarialQA). LLMs have saturated most benchmarks, shifting focus to harder multi-document and reasoning-intensive tasks.

History

2016

SQuAD 1.1 (Rajpurkar et al.) provides 100K+ extractive QA pairs; becomes the most cited NLP dataset

2017

BiDAF (Seo et al.) introduces bidirectional attention flow for passage comprehension

2018

BERT surpasses human F1 (93.2) on SQuAD 1.1, demonstrating that extractive RC is 'solved' for simple cases

2018

SQuAD 2.0 adds unanswerable questions — models must learn to abstain when no answer exists in the passage

2019

HotpotQA and MultiRC require multi-hop reasoning across multiple paragraphs

2019

CoQA and QuAC introduce conversational reading comprehension with follow-up questions

2020

UnifiedQA (Khashabi et al.) trains a single T5 model across 20+ RC datasets, showing format unification

2023

GPT-4 achieves near-perfect scores on SQuAD, CoQA, and NarrativeQA; focus shifts to harder benchmarks

2024

DROP (discrete reasoning over paragraphs) and IIRC (incomplete information) remain challenging for LLMs

How Reading Comprehension Works

1Input encodingQuestion and passage are co…2Span prediction (extr…Two linear heads predict st…3Answerability checkFor SQuAD 24Free-form generation …For generative RCReading Comprehension Pipeline
1

Input encoding

Question and passage are concatenated and encoded by the transformer; cross-attention allows the model to focus on relevant spans

2

Span prediction (extractive)

Two linear heads predict start and end positions of the answer span within the passage

3

Answerability check

For SQuAD 2.0-style tasks, a separate head predicts whether the question is answerable from the given passage

4

Free-form generation (abstractive)

For generative RC, the model produces the answer token by token, grounded in the passage context

Current Landscape

Reading comprehension in 2025 is a mature evaluation paradigm where standard benchmarks (SQuAD, CoQA, NarrativeQA) are effectively saturated by frontier LLMs. The task remains valuable as a component of more complex systems — RAG pipelines are essentially reading comprehension at scale. Research has shifted to harder variants: multi-hop reasoning (HotpotQA, MuSiQue), discrete reasoning (DROP), and adversarial robustness. The extractive RC paradigm (selecting spans) is being replaced by generative RC (free-form answers with citations).

Key Challenges

Multi-hop reasoning: questions requiring information from 2+ disconnected paragraphs remain much harder than single-hop

Shortcut exploitation: models often answer from passage-question lexical overlap rather than genuine comprehension

Free-form answer evaluation: comparing generated answers to references is error-prone (correct but differently worded answers score poorly)

Long-document comprehension: passages exceeding context windows require chunking strategies that may miss relevant spans

Conversational context: maintaining coreference and dialogue state across multi-turn QA is unsolved

Quick Recommendations

Best overall RC

GPT-4o or Claude 3.5 Sonnet

Near-perfect on SQuAD, CoQA; strong on multi-hop and reasoning-intensive benchmarks

Extractive RC (production)

DeBERTa-v3-large fine-tuned on SQuAD 2.0

93+ F1; no hallucination risk since answers are spans from the passage

Multi-dataset RC

UnifiedQA-v2 (T5-based)

Single model handles extractive, abstractive, multiple-choice, and yes/no QA formats

Conversational RC

GPT-4o with dialogue history in context

Handles coreference resolution and follow-up questions naturally

What's Next

The future of reading comprehension is its integration into agentic and multi-document reasoning systems. Standalone passage-level RC will give way to corpus-level QA where models must find, read, and synthesize across thousands of documents. Expect evaluation to shift from F1 on extracted spans to faithfulness and attribution quality in generated answers.

Benchmarks & SOTA

No datasets indexed for this task yet.

Contribute on GitHub

Related Tasks

Natural Language Inference

Determining entailment relationships between sentences (SNLI, MNLI).

Text Ranking

Text ranking is the invisible backbone of every search engine and RAG pipeline. The field was transformed by ColBERT (2020) introducing late interaction, then by instruction-tuned embedding models like E5-Mistral and GTE-Qwen that turned general LLMs into retrieval engines. MS MARCO and BEIR remain the standard battlegrounds, but the real test is zero-shot transfer — can a model trained on web search generalize to legal documents, scientific papers, and code? The gap between supervised and zero-shot performance has shrunk from 15+ points to under 3 in two years.

Table Question Answering

Table question answering bridges natural language and structured data — asking "what was Q3 revenue?" over a spreadsheet and getting the right cell or computed answer. Google's TAPAS (2020) pioneered joint table-text pre-training, and TAPEX trained on synthetic SQL execution traces to teach models tabular reasoning. The field shifted dramatically when GPT-4 and Claude demonstrated they could reason over tables in-context without any table-specific fine-tuning, often matching or beating specialized models on WikiTableQuestions and SQA. The hard frontier is multi-step numerical reasoning over large tables with hundreds of rows — exactly the kind of task where tool-augmented LLMs that generate and execute code are pulling ahead of pure neural approaches.

Question Answering

Extractive and abstractive question answering is one of the oldest NLP benchmarks, from the original SQuAD (2016) to the adversarial complexity of Natural Questions and TriviaQA. Human parity on SQuAD 2.0 was claimed by ALBERT in 2020, effectively saturating the benchmark — but real-world QA over noisy documents, multi-hop reasoning (HotpotQA, MuSiQue), and long-context grounding remain far from solved. The paradigm has shifted from standalone QA models to retrieval-augmented generation (RAG), where the bottleneck moved from answer extraction to retrieval quality. Modern systems like Perplexity and Google's AI Overviews show that production QA is now an end-to-end pipeline problem, not a single-model benchmark.

Something wrong or missing?

Help keep Reading Comprehension benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000