Home/Building Blocks/Document Question Answering

Document→Text

Document Question Answering

Answer questions about document content including text, tables, and layouts. Essential for document AI.

How Document Question Answering Works

A technical deep-dive into Document Question Answering and RAG (Retrieval-Augmented Generation). Understanding how to compose building blocks into a powerful document intelligence pipeline.

1. Core Insight 2. RAG Pipeline 3. Building Blocks 4. Chunking 5. Retrieval 6. Approaches 7. Benchmarks 8. Code

The Core Insight

Document QA is not a single model problem. It is a pipeline of composable building blocks.

The Problem

Documents contain vast amounts of information, but finding specific answers is like searching for a needle in a haystack. Users want to ask questions in natural language and get precise, cited answers.

The Solution

Break the problem into composable building blocks: process documents into text, chunk and embed that text, retrieve relevant passages, and generate answers with citations.

The Key Idea

Document QA is not a single model but a pipeline. Each stage can be optimized independently. The magic happens when you compose the right building blocks together.

Extractive vs Abstractive QA

Extractive QA

Find and highlight the exact span in the document that answers the question

EXAMPLE:

Q: When was the contract signed?

Context: "This agreement, signed on March 15, 2024, establishes..."

A: March 15, 2024

Pros

+ Always grounded in source

+ No hallucination risk

+ Easy to verify

Cons

- Cannot synthesize information

- Limited to verbatim text

- May miss paraphrased answers

Abstractive QA (RAG)

Generate a natural language answer based on retrieved context

EXAMPLE:

Q: What are the main terms of the contract?

Context: "...payment of $50,000... delivery within 30 days... termination clause..."

A: The contract specifies a $50,000 payment with delivery within 30 days, and includes a termination clause.

Pros

+ Natural responses

+ Can synthesize multiple sources

+ Handles complex questions

Cons

- Hallucination risk

- Harder to verify

- May miss nuances

The RAG Pipeline

Retrieval-Augmented Generation: the standard architecture for document QA at scale.

Document QA Pipeline

Ingestion (once per document)

Document

PDF, DOCX, images, scans

Parse/OCR

Extract text and structure

Chunk

Split into passages

Embed

Convert to vectors

Index

Store in vector DB

Query (every question)

Query

User question

Retrieve

Find relevant chunks

Generate

LLM answers with context

Answer

With citations

Walkthrough: From Document to Answer

SAMPLE DOCUMENT:

SERVICES AGREEMENT This Agreement is entered into as of January 15, 2024. 1. SERVICES The Consultant shall provide software development services including: - Backend API development - Database design and optimization - Code review and documentation 2. COMPENSATION Client agrees to pay Consultant $150 per hour. Invoices are due within 30 days of receipt. Maximum monthly hours: 160 3. TERM This Agreement begins on February 1, 2024 and continues for 12 months. Either party may terminate with 30 days written notice. 4. CONFIDENTIALITY All project information is confidential for 2 years after termination.

Step 1: Chunking

[0]SERVICES AGREEMENT. This Agreement is entered into as of January 15, 2024.

[1]1. SERVICES. The Consultant shall provide software development services including: Backend API development, Database design and optimization, Code review and documentation

[2]2. COMPENSATION. Client agrees to pay Consultant $150 per hour. Invoices are due within 30 days of receipt. Maximum monthly hours: 160

[3]3. TERM. This Agreement begins on February 1, 2024 and continues for 12 months. Either party may terminate with 30 days written notice.

[4]4. CONFIDENTIALITY. All project information is confidential for 2 years after termination.

Step 2: Query + Retrieve + Generate

Select a question:

Q: What is the hourly rate?

Retrieved chunks: [2]

A: The hourly rate is $150 per hour, as specified in section 2 (COMPENSATION) of the agreement.

Building Blocks Composition

Document QA combines three core building blocks from CodeSOTA. Each is independently optimizable.

Document to Structured

View building block ->

Parse PDFs, extract tables, run OCR on images

Mistral OCRAzure Document IntelligenceUnstructured.io

Text to Vector

View building block ->

Convert text chunks into dense embeddings for similarity search

OpenAI EmbeddingsCohere EmbedBGEE5

Text to Text (LLM)

View building block ->

Generate answers from retrieved context and user questions

GPT-4ClaudeLlamaMistral

How the Building Blocks Connect

Document to Structured

PDF, Image -> Text

Text to Vector

Chunks -> Embeddings

Vector DB

Store + Retrieve

Text to Text

Context + Q -> Answer

Each building block can be swapped independently. Upgrade your embedding model without changing your LLM.

Chunking Strategies

How you split documents dramatically affects retrieval quality. There is no one-size-fits-all approach.

Fixed Size

Split text every N characters/tokens with optional overlap

chunk_size=512, overlap=50

Pros

+ Simple to implement

+ Predictable chunk sizes

+ Works with any text

Cons

- Breaks mid-sentence

- Ignores document structure

- May split important context

Example Split

...end of paragraph.| Start of next...

Sentence-Based

Split on sentence boundaries, group until size limit

max_sentences=10

Pros

+ Preserves sentence integrity

+ More semantic coherence

+ Better for extraction

Cons

- Variable chunk sizes

- May still break paragraphs

- Sentence detection can fail

Example Split

Sentence 1. Sentence 2. | Sentence 3...

Semantic

Use embeddings to find natural topic boundaries

similarity_threshold=0.75

Pros

+ Respects topic changes

+ Better retrieval quality

+ Content-aware

Cons

- Computationally expensive

- Requires embedding model

- Complex to tune

Example Split

[Topic A content] | [Topic B content]

Document Structure

Use headers, sections, and document hierarchy

use_headers=True, max_section_size=1000

Pros

+ Preserves document logic

+ Natural for reports/papers

+ Maintains context

Cons

- Requires structured documents

- Varies by format

- Headers may be missing

Example Split

## Section 1\n Content | ## Section 2\n Content

Hierarchical

Create chunks at multiple levels (paragraph, section, document)

levels=['paragraph', 'section', 'document']

Pros

+ Enables multi-level retrieval

+ Best of all worlds

+ Rich context

Cons

- Most complex

- Higher storage costs

- Multiple indices to maintain

Example Split

Para -> Section -> Document

Chunking Best Practices

- Start with 512-1000 tokens per chunk. Too small loses context, too large dilutes relevance.
- Always use overlap (10-20%) to avoid splitting important context at boundaries.
- Include metadata (section headers, page numbers) for better retrieval and citation.
- For structured documents, prefer document-structure chunking over fixed size.
- Test with your actual queries. The best strategy depends on your use case.

Retrieval Methods

Finding the right chunks is critical. The best systems combine multiple retrieval approaches.

Dense Retrieval

Embed query and documents, find nearest neighbors in vector space

Mechanism:

query_vector = embed(question); results = vector_db.search(query_vector, k=5)

Pros

+ Semantic understanding

+ Handles paraphrasing

+ State-of-the-art quality

Cons

- Requires GPU for large scale

- Embedding model choice matters

- May miss exact matches

Sparse (BM25)

Traditional keyword matching with TF-IDF weighting

Mechanism:

score = sum(IDF(term) * TF(term, doc)) for term in query

Pros

+ Fast and efficient

+ Great for exact matches

+ No GPU needed

Cons

- No semantic understanding

- Misses synonyms

- Vocabulary mismatch

Hybrid

Combine dense and sparse scores for best of both worlds

Mechanism:

final_score = alpha * dense_score + (1 - alpha) * sparse_score

Pros

+ Best retrieval quality

+ Handles both semantic and keyword

+ Industry standard

Cons

- More complex

- Two indices to maintain

- Fusion tuning needed

Reranking

Retrieve candidates with fast method, then rerank with cross-encoder

Mechanism:

candidates = bm25.search(k=100); reranked = cross_encoder.rerank(query, candidates)

Pros

+ Highest accuracy

+ Cross-attention between query and doc

+ Fine-grained ranking

Cons

- Slower (two-stage)

- Cross-encoder is expensive

- Latency trade-off

Vector Databases for Production

Database	Type	Strengths	Scale	Pricing
Pinecone	Managed	Fully managed, fast, metadata filtering	Billions of vectors	Pay per use
Weaviate	Open Source / Managed	GraphQL API, hybrid search, modules	Millions to billions	Self-host free, managed paid
Qdrant	Open Source / Managed	Rust performance, filtering, payloads	Millions to billions	Self-host free, cloud paid
Chroma	Open Source	Simple API, embedded mode, Python-native	Millions	Free
pgvector	Open Source	PostgreSQL extension, familiar SQL	Millions	Free (use existing Postgres)

Approaches: End-to-End vs RAG vs Long-Context

Three fundamentally different ways to build document QA. Choose based on your documents and requirements.

End-to-End (LayoutLM, Donut)

Single model reads document image and answers directly

Pros

+ Simpler architecture

+ No chunking needed

+ Understands layout

Cons

- Limited context window

- Cannot cite sources

- Less flexible

Best for: Forms, receipts, structured documents with visual layout

RAG Pipeline

Compose OCR + chunking + retrieval + LLM

Pros

+ Handles any document length

+ Citations/sources

+ Each component upgradeable

Cons

- More moving parts

- Retrieval quality critical

- Higher latency

Best for: Long documents, multi-document QA, enterprise search

Long-Context LLM

Feed entire document directly to LLM (Claude 200K, GPT-4 128K)

Pros

+ No chunking or retrieval

+ Sees full context

+ Simple pipeline

Cons

- Expensive at scale

- Attention dilution

- Context limit still exists

Best for: Single document QA, document length < 100K tokens

Models and Frameworks

Model/Framework	Type	Architecture	Context	Strengths
LayoutLMv3	End-to-End	Multimodal Transformer	512 tokens	Understands document layout, tables, forms
Donut	End-to-End	Vision Encoder-Decoder	Image-based	OCR-free, reads directly from pixels
Pix2Struct	End-to-End	Vision-Language	4096 patches	Charts, infographics, screenshots
LlamaIndex	RAG Framework	Pipeline Orchestration	Configurable	Full RAG pipeline, many integrations
LangChain	RAG Framework	Pipeline Orchestration	Configurable	Flexible chains, wide ecosystem
Haystack	RAG Framework	Pipeline Orchestration	Configurable	Production-ready, enterprise features

Use End-to-End when:

- Documents are short (1-2 pages)
- Layout is important (forms, tables)
- You need fast single-doc extraction

Use RAG when:

- Many or long documents
- Need citations/sources
- Questions span multiple docs

Use Long-Context LLM when:

- Single doc under 100K tokens
- Simplicity matters most
- Budget allows API costs

Benchmarks

Standard datasets for evaluating document QA systems.

Benchmark	Type	Size	Description	Metric	SOTA
DocVQA	Document Visual QA	50K QA pairs	Questions about scanned documents	ANLS	Donut: 84.1%
InfographicsVQA	Infographic QA	30K QA pairs	Questions about infographics and charts	ANLS	Pix2Struct: 42.5%
DUDE	Multi-page Document QA	5K documents	Questions requiring multi-page reasoning	ANLS	GPT-4V: 53.7%
Natural Questions	Open Domain QA	307K QA pairs	Real Google search questions	EM / F1	FiD + RAG: 51.4 / 57.1
SQuAD 2.0	Reading Comprehension	150K QA pairs	Questions from Wikipedia paragraphs	EM / F1	Human: 86.8 / 89.5

ANLS

Average Normalized Levenshtein Similarity. Measures edit distance between prediction and ground truth, normalized by answer length.

EM (Exact Match)

Binary metric: 1 if prediction exactly matches ground truth, 0 otherwise. Strict but clear evaluation.

F1 Score

Token-level overlap between prediction and ground truth. Harmonic mean of precision and recall.

Code Examples

From quick RAG setup to production-ready hybrid search.

LlamaIndexpip install llama-index llama-index-llms-openai

Recommended

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.openai import OpenAI

# 1. Load documents from a directory
# Supports PDF, DOCX, TXT, HTML, and more
documents = SimpleDirectoryReader("./data/contracts/").load_data()

# 2. Create vector index (chunks, embeds, and stores)
# Under the hood: chunking -> embedding -> vector store
index = VectorStoreIndex.from_documents(
    documents,
    chunk_size=512,
    chunk_overlap=50
)

# 3. Create query engine with citation
query_engine = index.as_query_engine(
    llm=OpenAI(model="gpt-4"),
    response_mode="compact",  # or "tree_summarize" for longer answers
    similarity_top_k=5        # retrieve top 5 chunks
)

# 4. Ask questions and get cited answers
response = query_engine.query(
    "What are the payment terms in the contract?"
)

print(f"Answer: {response.response}")
print(f"\nSources:")
for node in response.source_nodes:
    print(f"  - {node.node.metadata['file_name']}: {node.score:.3f}")
    print(f"    '{node.node.text[:100]}...'")

Quick Reference

Building Blocks

- Document to Structured (OCR)
- Text to Vector (embed)
- Text to Text (LLM)

For Production

- LlamaIndex or LangChain
- Hybrid search (dense + BM25)
- Pinecone or Qdrant

For Visual Docs

- LayoutLMv3 for forms
- Donut for receipts
- Pix2Struct for charts

Key Decisions

- Chunk size: 512-1000 tokens
- Retrieval: hybrid for best quality
- LLM: GPT-4 or Claude for generation

Key Takeaways

1. Document QA = composition of building blocks (OCR + embed + retrieve + LLM)
2. Chunking strategy significantly impacts retrieval quality
3. Hybrid search (dense + sparse) beats either alone
4. Choose end-to-end for forms, RAG for long docs, long-context LLM for simplicity