Home/Building Blocks/Document Question Answering
DocumentText

Document Question Answering

Answer questions about document content including text, tables, and layouts. Essential for document AI.

How Document Question Answering Works

A technical deep-dive into Document Question Answering and RAG (Retrieval-Augmented Generation). Understanding how to compose building blocks into a powerful document intelligence pipeline.

1

The Core Insight

Document QA is not a single model problem. It is a pipeline of composable building blocks.

The Problem

Documents contain vast amounts of information, but finding specific answers is like searching for a needle in a haystack. Users want to ask questions in natural language and get precise, cited answers.

The Solution

Break the problem into composable building blocks: process documents into text, chunk and embed that text, retrieve relevant passages, and generate answers with citations.

The Key Idea

Document QA is not a single model but a pipeline. Each stage can be optimized independently. The magic happens when you compose the right building blocks together.

Extractive vs Abstractive QA

Extractive QA

Find and highlight the exact span in the document that answers the question

EXAMPLE:
Q: When was the contract signed?
Context: "This agreement, signed on March 15, 2024, establishes..."
A: March 15, 2024
Pros
+ Always grounded in source
+ No hallucination risk
+ Easy to verify
Cons
- Cannot synthesize information
- Limited to verbatim text
- May miss paraphrased answers
Abstractive QA (RAG)

Generate a natural language answer based on retrieved context

EXAMPLE:
Q: What are the main terms of the contract?
Context: "...payment of $50,000... delivery within 30 days... termination clause..."
A: The contract specifies a $50,000 payment with delivery within 30 days, and includes a termination clause.
Pros
+ Natural responses
+ Can synthesize multiple sources
+ Handles complex questions
Cons
- Hallucination risk
- Harder to verify
- May miss nuances
2

The RAG Pipeline

Retrieval-Augmented Generation: the standard architecture for document QA at scale.

Document QA Pipeline

Ingestion (once per document)
Document
PDF, DOCX, images, scans
->
Parse/OCR
Extract text and structure
->
Chunk
Split into passages
->
Embed
Convert to vectors
->
Index
Store in vector DB
Query (every question)
Query
User question
->
Retrieve
Find relevant chunks
->
Generate
LLM answers with context
->
Answer
With citations

Walkthrough: From Document to Answer

SAMPLE DOCUMENT:
SERVICES AGREEMENT This Agreement is entered into as of January 15, 2024. 1. SERVICES The Consultant shall provide software development services including: - Backend API development - Database design and optimization - Code review and documentation 2. COMPENSATION Client agrees to pay Consultant $150 per hour. Invoices are due within 30 days of receipt. Maximum monthly hours: 160 3. TERM This Agreement begins on February 1, 2024 and continues for 12 months. Either party may terminate with 30 days written notice. 4. CONFIDENTIALITY All project information is confidential for 2 years after termination.
Step 1: Chunking
[0]SERVICES AGREEMENT. This Agreement is entered into as of January 15, 2024.
[1]1. SERVICES. The Consultant shall provide software development services including: Backend API development, Database design and optimization, Code review and documentation
[2]2. COMPENSATION. Client agrees to pay Consultant $150 per hour. Invoices are due within 30 days of receipt. Maximum monthly hours: 160
[3]3. TERM. This Agreement begins on February 1, 2024 and continues for 12 months. Either party may terminate with 30 days written notice.
[4]4. CONFIDENTIALITY. All project information is confidential for 2 years after termination.
Step 2: Query + Retrieve + Generate
Select a question:
Q: What is the hourly rate?
Retrieved chunks: [2]
A: The hourly rate is $150 per hour, as specified in section 2 (COMPENSATION) of the agreement.
3

Building Blocks Composition

Document QA combines three core building blocks from CodeSOTA. Each is independently optimizable.

D
Document to Structured
View building block ->

Parse PDFs, extract tables, run OCR on images

Mistral OCRAzure Document IntelligenceUnstructured.io
V

Convert text chunks into dense embeddings for similarity search

OpenAI EmbeddingsCohere EmbedBGEE5
T
Text to Text (LLM)
View building block ->

Generate answers from retrieved context and user questions

GPT-4ClaudeLlamaMistral

How the Building Blocks Connect

Document to Structured
PDF, Image -> Text
|
Text to Vector
Chunks -> Embeddings
|
Vector DB
Store + Retrieve
|
Text to Text
Context + Q -> Answer

Each building block can be swapped independently. Upgrade your embedding model without changing your LLM.

4

Chunking Strategies

How you split documents dramatically affects retrieval quality. There is no one-size-fits-all approach.

1
Fixed Size
Split text every N characters/tokens with optional overlap
chunk_size=512, overlap=50
Pros
+ Simple to implement
+ Predictable chunk sizes
+ Works with any text
Cons
- Breaks mid-sentence
- Ignores document structure
- May split important context
Example Split
...end of paragraph.| Start of next...
2
Sentence-Based
Split on sentence boundaries, group until size limit
max_sentences=10
Pros
+ Preserves sentence integrity
+ More semantic coherence
+ Better for extraction
Cons
- Variable chunk sizes
- May still break paragraphs
- Sentence detection can fail
Example Split
Sentence 1. Sentence 2. | Sentence 3...
3
Semantic
Use embeddings to find natural topic boundaries
similarity_threshold=0.75
Pros
+ Respects topic changes
+ Better retrieval quality
+ Content-aware
Cons
- Computationally expensive
- Requires embedding model
- Complex to tune
Example Split
[Topic A content] | [Topic B content]
4
Document Structure
Use headers, sections, and document hierarchy
use_headers=True, max_section_size=1000
Pros
+ Preserves document logic
+ Natural for reports/papers
+ Maintains context
Cons
- Requires structured documents
- Varies by format
- Headers may be missing
Example Split
## Section 1\n Content | ## Section 2\n Content
5
Hierarchical
Create chunks at multiple levels (paragraph, section, document)
levels=['paragraph', 'section', 'document']
Pros
+ Enables multi-level retrieval
+ Best of all worlds
+ Rich context
Cons
- Most complex
- Higher storage costs
- Multiple indices to maintain
Example Split
Para -> Section -> Document
Chunking Best Practices
  • - Start with 512-1000 tokens per chunk. Too small loses context, too large dilutes relevance.
  • - Always use overlap (10-20%) to avoid splitting important context at boundaries.
  • - Include metadata (section headers, page numbers) for better retrieval and citation.
  • - For structured documents, prefer document-structure chunking over fixed size.
  • - Test with your actual queries. The best strategy depends on your use case.
5

Retrieval Methods

Finding the right chunks is critical. The best systems combine multiple retrieval approaches.

Dense Retrieval

Embed query and documents, find nearest neighbors in vector space

Mechanism:
query_vector = embed(question); results = vector_db.search(query_vector, k=5)
Pros
+ Semantic understanding
+ Handles paraphrasing
+ State-of-the-art quality
Cons
- Requires GPU for large scale
- Embedding model choice matters
- May miss exact matches
Sparse (BM25)

Traditional keyword matching with TF-IDF weighting

Mechanism:
score = sum(IDF(term) * TF(term, doc)) for term in query
Pros
+ Fast and efficient
+ Great for exact matches
+ No GPU needed
Cons
- No semantic understanding
- Misses synonyms
- Vocabulary mismatch
Hybrid

Combine dense and sparse scores for best of both worlds

Mechanism:
final_score = alpha * dense_score + (1 - alpha) * sparse_score
Pros
+ Best retrieval quality
+ Handles both semantic and keyword
+ Industry standard
Cons
- More complex
- Two indices to maintain
- Fusion tuning needed
Reranking

Retrieve candidates with fast method, then rerank with cross-encoder

Mechanism:
candidates = bm25.search(k=100); reranked = cross_encoder.rerank(query, candidates)
Pros
+ Highest accuracy
+ Cross-attention between query and doc
+ Fine-grained ranking
Cons
- Slower (two-stage)
- Cross-encoder is expensive
- Latency trade-off

Vector Databases for Production

DatabaseTypeStrengthsScalePricing
PineconeManagedFully managed, fast, metadata filteringBillions of vectorsPay per use
WeaviateOpen Source / ManagedGraphQL API, hybrid search, modulesMillions to billionsSelf-host free, managed paid
QdrantOpen Source / ManagedRust performance, filtering, payloadsMillions to billionsSelf-host free, cloud paid
ChromaOpen SourceSimple API, embedded mode, Python-nativeMillionsFree
pgvectorOpen SourcePostgreSQL extension, familiar SQLMillionsFree (use existing Postgres)
6

Approaches: End-to-End vs RAG vs Long-Context

Three fundamentally different ways to build document QA. Choose based on your documents and requirements.

End-to-End (LayoutLM, Donut)

Single model reads document image and answers directly

Pros
+ Simpler architecture
+ No chunking needed
+ Understands layout
Cons
- Limited context window
- Cannot cite sources
- Less flexible
Best for: Forms, receipts, structured documents with visual layout
RAG Pipeline

Compose OCR + chunking + retrieval + LLM

Pros
+ Handles any document length
+ Citations/sources
+ Each component upgradeable
Cons
- More moving parts
- Retrieval quality critical
- Higher latency
Best for: Long documents, multi-document QA, enterprise search
Long-Context LLM

Feed entire document directly to LLM (Claude 200K, GPT-4 128K)

Pros
+ No chunking or retrieval
+ Sees full context
+ Simple pipeline
Cons
- Expensive at scale
- Attention dilution
- Context limit still exists
Best for: Single document QA, document length < 100K tokens

Models and Frameworks

Model/FrameworkTypeArchitectureContextStrengths
LayoutLMv3End-to-EndMultimodal Transformer512 tokensUnderstands document layout, tables, forms
DonutEnd-to-EndVision Encoder-DecoderImage-basedOCR-free, reads directly from pixels
Pix2StructEnd-to-EndVision-Language4096 patchesCharts, infographics, screenshots
LlamaIndexRAG FrameworkPipeline OrchestrationConfigurableFull RAG pipeline, many integrations
LangChainRAG FrameworkPipeline OrchestrationConfigurableFlexible chains, wide ecosystem
HaystackRAG FrameworkPipeline OrchestrationConfigurableProduction-ready, enterprise features
Use End-to-End when:
  • - Documents are short (1-2 pages)
  • - Layout is important (forms, tables)
  • - You need fast single-doc extraction
Use RAG when:
  • - Many or long documents
  • - Need citations/sources
  • - Questions span multiple docs
Use Long-Context LLM when:
  • - Single doc under 100K tokens
  • - Simplicity matters most
  • - Budget allows API costs
7

Benchmarks

Standard datasets for evaluating document QA systems.

BenchmarkTypeSizeDescriptionMetricSOTA
DocVQADocument Visual QA50K QA pairsQuestions about scanned documentsANLSDonut: 84.1%
InfographicsVQAInfographic QA30K QA pairsQuestions about infographics and chartsANLSPix2Struct: 42.5%
DUDEMulti-page Document QA5K documentsQuestions requiring multi-page reasoningANLSGPT-4V: 53.7%
Natural QuestionsOpen Domain QA307K QA pairsReal Google search questionsEM / F1FiD + RAG: 51.4 / 57.1
SQuAD 2.0Reading Comprehension150K QA pairsQuestions from Wikipedia paragraphsEM / F1Human: 86.8 / 89.5
ANLS

Average Normalized Levenshtein Similarity. Measures edit distance between prediction and ground truth, normalized by answer length.

EM (Exact Match)

Binary metric: 1 if prediction exactly matches ground truth, 0 otherwise. Strict but clear evaluation.

F1 Score

Token-level overlap between prediction and ground truth. Harmonic mean of precision and recall.

8

Code Examples

From quick RAG setup to production-ready hybrid search.

LlamaIndexpip install llama-index llama-index-llms-openai
Recommended
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.openai import OpenAI

# 1. Load documents from a directory
# Supports PDF, DOCX, TXT, HTML, and more
documents = SimpleDirectoryReader("./data/contracts/").load_data()

# 2. Create vector index (chunks, embeds, and stores)
# Under the hood: chunking -> embedding -> vector store
index = VectorStoreIndex.from_documents(
    documents,
    chunk_size=512,
    chunk_overlap=50
)

# 3. Create query engine with citation
query_engine = index.as_query_engine(
    llm=OpenAI(model="gpt-4"),
    response_mode="compact",  # or "tree_summarize" for longer answers
    similarity_top_k=5        # retrieve top 5 chunks
)

# 4. Ask questions and get cited answers
response = query_engine.query(
    "What are the payment terms in the contract?"
)

print(f"Answer: {response.response}")
print(f"\nSources:")
for node in response.source_nodes:
    print(f"  - {node.node.metadata['file_name']}: {node.score:.3f}")
    print(f"    '{node.node.text[:100]}...'")

Quick Reference

Building Blocks
  • - Document to Structured (OCR)
  • - Text to Vector (embed)
  • - Text to Text (LLM)
For Production
  • - LlamaIndex or LangChain
  • - Hybrid search (dense + BM25)
  • - Pinecone or Qdrant
For Visual Docs
  • - LayoutLMv3 for forms
  • - Donut for receipts
  • - Pix2Struct for charts
Key Decisions
  • - Chunk size: 512-1000 tokens
  • - Retrieval: hybrid for best quality
  • - LLM: GPT-4 or Claude for generation
Key Takeaways
  • 1. Document QA = composition of building blocks (OCR + embed + retrieve + LLM)
  • 2. Chunking strategy significantly impacts retrieval quality
  • 3. Hybrid search (dense + sparse) beats either alone
  • 4. Choose end-to-end for forms, RAG for long docs, long-context LLM for simplicity

Use Cases

  • Contract analysis
  • Invoice querying
  • Form processing
  • Legal document review
  • Research paper Q&A

Architectural Patterns

Layout-Aware Transformers

Models that understand 2D document layout (LayoutLM).

Pros:
  • +Understands tables
  • +Position-aware
Cons:
  • -Needs layout annotations
  • -Fixed page size

VLM on Document Images

Treat documents as images, use vision-language models.

Pros:
  • +No OCR needed
  • +Handles any format
Cons:
  • -Resolution limits
  • -May miss small text

OCR + LLM

Extract text with OCR, query with LLM.

Pros:
  • +Simple pipeline
  • +Accurate text extraction
Cons:
  • -Loses layout
  • -OCR errors propagate

Implementations

API Services

GPT-4V

OpenAI
API

Direct document image understanding. Strong OCR.

Azure Document Intelligence

Microsoft
API

Layout extraction + custom QA training.

Open Source

LayoutLMv3

CC-BY-NC 4.0
Open Source

Best for structured documents. Layout + text + image.

Donut

MIT
Open Source

OCR-free document understanding. End-to-end.

DocVQA-BERT

CC-BY-4.0
Open Source

Fine-tuned LayoutLM for DocVQA.

Benchmarks

Quick Facts

Input
Document
Output
Text
Implementations
3 open source, 2 API
Patterns
3 approaches

Have benchmark data?

Help us track the state of the art for document question answering.

Submit Results