Document Question Answering
Answer questions about document content including text, tables, and layouts. Essential for document AI.
How Document Question Answering Works
A technical deep-dive into Document Question Answering and RAG (Retrieval-Augmented Generation). Understanding how to compose building blocks into a powerful document intelligence pipeline.
The Core Insight
Document QA is not a single model problem. It is a pipeline of composable building blocks.
Documents contain vast amounts of information, but finding specific answers is like searching for a needle in a haystack. Users want to ask questions in natural language and get precise, cited answers.
Break the problem into composable building blocks: process documents into text, chunk and embed that text, retrieve relevant passages, and generate answers with citations.
Document QA is not a single model but a pipeline. Each stage can be optimized independently. The magic happens when you compose the right building blocks together.
Extractive vs Abstractive QA
Find and highlight the exact span in the document that answers the question
Generate a natural language answer based on retrieved context
The RAG Pipeline
Retrieval-Augmented Generation: the standard architecture for document QA at scale.
Document QA Pipeline
Walkthrough: From Document to Answer
Building Blocks Composition
Document QA combines three core building blocks from CodeSOTA. Each is independently optimizable.
Parse PDFs, extract tables, run OCR on images
Convert text chunks into dense embeddings for similarity search
Generate answers from retrieved context and user questions
How the Building Blocks Connect
Each building block can be swapped independently. Upgrade your embedding model without changing your LLM.
Chunking Strategies
How you split documents dramatically affects retrieval quality. There is no one-size-fits-all approach.
chunk_size=512, overlap=50max_sentences=10similarity_threshold=0.75use_headers=True, max_section_size=1000levels=['paragraph', 'section', 'document']- - Start with 512-1000 tokens per chunk. Too small loses context, too large dilutes relevance.
- - Always use overlap (10-20%) to avoid splitting important context at boundaries.
- - Include metadata (section headers, page numbers) for better retrieval and citation.
- - For structured documents, prefer document-structure chunking over fixed size.
- - Test with your actual queries. The best strategy depends on your use case.
Retrieval Methods
Finding the right chunks is critical. The best systems combine multiple retrieval approaches.
Embed query and documents, find nearest neighbors in vector space
query_vector = embed(question); results = vector_db.search(query_vector, k=5)Traditional keyword matching with TF-IDF weighting
score = sum(IDF(term) * TF(term, doc)) for term in queryCombine dense and sparse scores for best of both worlds
final_score = alpha * dense_score + (1 - alpha) * sparse_scoreRetrieve candidates with fast method, then rerank with cross-encoder
candidates = bm25.search(k=100); reranked = cross_encoder.rerank(query, candidates)Vector Databases for Production
| Database | Type | Strengths | Scale | Pricing |
|---|---|---|---|---|
| Pinecone | Managed | Fully managed, fast, metadata filtering | Billions of vectors | Pay per use |
| Weaviate | Open Source / Managed | GraphQL API, hybrid search, modules | Millions to billions | Self-host free, managed paid |
| Qdrant | Open Source / Managed | Rust performance, filtering, payloads | Millions to billions | Self-host free, cloud paid |
| Chroma | Open Source | Simple API, embedded mode, Python-native | Millions | Free |
| pgvector | Open Source | PostgreSQL extension, familiar SQL | Millions | Free (use existing Postgres) |
Approaches: End-to-End vs RAG vs Long-Context
Three fundamentally different ways to build document QA. Choose based on your documents and requirements.
Single model reads document image and answers directly
Compose OCR + chunking + retrieval + LLM
Feed entire document directly to LLM (Claude 200K, GPT-4 128K)
Models and Frameworks
| Model/Framework | Type | Architecture | Context | Strengths |
|---|---|---|---|---|
| LayoutLMv3 | End-to-End | Multimodal Transformer | 512 tokens | Understands document layout, tables, forms |
| Donut | End-to-End | Vision Encoder-Decoder | Image-based | OCR-free, reads directly from pixels |
| Pix2Struct | End-to-End | Vision-Language | 4096 patches | Charts, infographics, screenshots |
| LlamaIndex | RAG Framework | Pipeline Orchestration | Configurable | Full RAG pipeline, many integrations |
| LangChain | RAG Framework | Pipeline Orchestration | Configurable | Flexible chains, wide ecosystem |
| Haystack | RAG Framework | Pipeline Orchestration | Configurable | Production-ready, enterprise features |
- - Documents are short (1-2 pages)
- - Layout is important (forms, tables)
- - You need fast single-doc extraction
- - Many or long documents
- - Need citations/sources
- - Questions span multiple docs
- - Single doc under 100K tokens
- - Simplicity matters most
- - Budget allows API costs
Benchmarks
Standard datasets for evaluating document QA systems.
| Benchmark | Type | Size | Description | Metric | SOTA |
|---|---|---|---|---|---|
| DocVQA | Document Visual QA | 50K QA pairs | Questions about scanned documents | ANLS | Donut: 84.1% |
| InfographicsVQA | Infographic QA | 30K QA pairs | Questions about infographics and charts | ANLS | Pix2Struct: 42.5% |
| DUDE | Multi-page Document QA | 5K documents | Questions requiring multi-page reasoning | ANLS | GPT-4V: 53.7% |
| Natural Questions | Open Domain QA | 307K QA pairs | Real Google search questions | EM / F1 | FiD + RAG: 51.4 / 57.1 |
| SQuAD 2.0 | Reading Comprehension | 150K QA pairs | Questions from Wikipedia paragraphs | EM / F1 | Human: 86.8 / 89.5 |
Average Normalized Levenshtein Similarity. Measures edit distance between prediction and ground truth, normalized by answer length.
Binary metric: 1 if prediction exactly matches ground truth, 0 otherwise. Strict but clear evaluation.
Token-level overlap between prediction and ground truth. Harmonic mean of precision and recall.
Code Examples
From quick RAG setup to production-ready hybrid search.
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.openai import OpenAI
# 1. Load documents from a directory
# Supports PDF, DOCX, TXT, HTML, and more
documents = SimpleDirectoryReader("./data/contracts/").load_data()
# 2. Create vector index (chunks, embeds, and stores)
# Under the hood: chunking -> embedding -> vector store
index = VectorStoreIndex.from_documents(
documents,
chunk_size=512,
chunk_overlap=50
)
# 3. Create query engine with citation
query_engine = index.as_query_engine(
llm=OpenAI(model="gpt-4"),
response_mode="compact", # or "tree_summarize" for longer answers
similarity_top_k=5 # retrieve top 5 chunks
)
# 4. Ask questions and get cited answers
response = query_engine.query(
"What are the payment terms in the contract?"
)
print(f"Answer: {response.response}")
print(f"\nSources:")
for node in response.source_nodes:
print(f" - {node.node.metadata['file_name']}: {node.score:.3f}")
print(f" '{node.node.text[:100]}...'")Quick Reference
- - Document to Structured (OCR)
- - Text to Vector (embed)
- - Text to Text (LLM)
- - LlamaIndex or LangChain
- - Hybrid search (dense + BM25)
- - Pinecone or Qdrant
- - LayoutLMv3 for forms
- - Donut for receipts
- - Pix2Struct for charts
- - Chunk size: 512-1000 tokens
- - Retrieval: hybrid for best quality
- - LLM: GPT-4 or Claude for generation
- 1. Document QA = composition of building blocks (OCR + embed + retrieve + LLM)
- 2. Chunking strategy significantly impacts retrieval quality
- 3. Hybrid search (dense + sparse) beats either alone
- 4. Choose end-to-end for forms, RAG for long docs, long-context LLM for simplicity
Use Cases
- ✓Contract analysis
- ✓Invoice querying
- ✓Form processing
- ✓Legal document review
- ✓Research paper Q&A
Architectural Patterns
Layout-Aware Transformers
Models that understand 2D document layout (LayoutLM).
- +Understands tables
- +Position-aware
- -Needs layout annotations
- -Fixed page size
VLM on Document Images
Treat documents as images, use vision-language models.
- +No OCR needed
- +Handles any format
- -Resolution limits
- -May miss small text
OCR + LLM
Extract text with OCR, query with LLM.
- +Simple pipeline
- +Accurate text extraction
- -Loses layout
- -OCR errors propagate
Implementations
API Services
GPT-4V
OpenAIDirect document image understanding. Strong OCR.
Azure Document Intelligence
MicrosoftLayout extraction + custom QA training.
Open Source
Benchmarks
Quick Facts
- Input
- Document
- Output
- Text
- Implementations
- 3 open source, 2 API
- Patterns
- 3 approaches
Have benchmark data?
Help us track the state of the art for document question answering.
Submit Results