Home/Guides/Invoice Processing with VLLMs

Deep DiveUpdated Dec 2025

Invoice Processing with Vision Language Models

Complete technical guide to extracting structured data from invoices using GPT-5.2, Claude Opus 4.5, Gemini 3 Pro, and open-source alternatives. Benchmarks, pricing, code, and production patterns.

Models Compared

96.5%

Best DocVQA (Qwen3-VL)

$0.0006

Lowest Cost (Gemini 2.5)

Dec 2025

Last Updated

Executive Summary

Key Findings (December 2025)

1.Gemini 3 Pro leads OCR with 0.115 on OmniDocBench (lowest is best). 50% better on poor-quality document photos than predecessors.
2.Open-source beats proprietary. Qwen3-VL scores 96.5% on DocVQA, beating all API models including GPT-5.2.
3.Cost varies 75x between providers. Gemini 2.5 Pro at $0.0006/invoice vs Claude Opus 4.5 at $0.045/invoice.

4.GPT-5.2 halves vision errors vs GPT-5.1 on chart reasoning. 400K context window handles massive multi-page documents.
5.Claude Opus 4.5 for precision. Best at zoom-level inspection and strict instruction following for complex workflows.
6.Self-hosting breaks even at ~50K invoices/month. Below that, Gemini 2.5/3 Pro APIs are cheaper than infrastructure.

Best Accuracy

Qwen3-VL

96.5% DocVQA | Self-hosted

Best OCR Quality

Gemini 3 Pro

0.115 OmniDocBench | API

Best Value

Gemini 2.5 Pro

$0.0006/invoice | 93% accuracy

Model Comparison

Comprehensive comparison of Vision Language Models for invoice extraction, including benchmark scores, pricing, and deployment options.

Model	DocVQA	OmniDocBench	CharXiv	Latency	Cost/Invoice	Type
GPT-5.2OpenAI	95.2%	0.118	88.7%	2-4s	$0.0060	API
Gemini 3 ProGoogle	96.1%	0.115	91.2%	1-3s	$0.0040	API
Claude Opus 4.5Anthropic	94.8%	0.145	87.3%	2-5s	$0.0450	API
Gemini 2.5 ProGoogle	93.4%	0.145	85.1%	1-2s	$0.0006	API
Qwen3-VLAlibaba (Open Source)	96.5%	0.108	89.4%	3-8s (self-hosted)	$0.0020	Open Source
Qwen2.5-VL-72BAlibaba (Open Source)	96.4%	0.112	88.9%	3-8s (self-hosted)	$0.0020	Open Source

GPT-5.2

APIOpenAI

DocVQA Score

95.2%

Input

$1.75/1M tokens

Output

$14/1M tokens

Per Invoice (avg)

$0.0060

Strengths

+Halves error rates on chart/document reasoning vs GPT-5.1
+400K context window, 128K output
+30% reduction in hallucinations
+Cached input at $0.18/M (90% discount for RAG)

Weaknesses

-1.4x cost increase over GPT-5.1
-Rate limits on vision requests
-No on-premise option

Best For

Production systems requiring highest accuracy with large docs

Gemini 3 Pro

APIGoogle

DocVQA Score

96.1%

Input

$1.25/1M tokens

Output

$5/1M tokens

Per Invoice (avg)

$0.0040

Strengths

+Best-in-class OCR (0.115 OmniDocBench - lowest is best)
+50% better on poor-quality document photos
+Spatial reasoning beyond simple OCR
+Handles complex tables with merged headers

Weaknesses

-Raw output needs structuring for workflows
-PDF table extraction inconsistent on complex layouts
-Requires Google Cloud ecosystem

Best For

High-volume document processing, poor quality scans

Claude Opus 4.5

APIAnthropic

DocVQA Score

94.8%

Input

$15/1M tokens

Output

$75/1M tokens

Per Invoice (avg)

$0.0450

Strengths

+Best for long documents + strict instruction following
+Zoom-level inspection and UI reading
+Fine-grained optical understanding
+Superior precision for tool-use workflows

Weaknesses

-Most expensive option (3-10x others)
-Overkill for simple extraction tasks
-No native JSON mode

Best For

Complex multi-page documents requiring reasoning

Gemini 2.5 Pro

APIGoogle

DocVQA Score

93.4%

Input

$0.15/1M tokens

Output

$0.6/1M tokens

Per Invoice (avg)

$0.0006

Strengths

+Extremely cost-effective (10x cheaper than GPT-5.2)
+Fast inference speed
+Good accuracy for price point
+Generous free tier

Weaknesses

-Lower accuracy than Gemini 3 Pro
-Less reliable on complex tables
-Being superseded by Gemini 3

Best For

Budget-conscious high-volume processing

Benchmark Datasets

The scores above come from standardized academic benchmarks. Understanding what each benchmark tests helps evaluate which model fits your use case.

OmniDocBench

View dataset

Comprehensive OCR benchmark v1.5 - Overall Edit Distance across diverse document types

Metric:Edit Distance (lower is better)

Relevance:Gold standard for raw OCR quality - Gemini 3 Pro leads at 0.115

DocVQA

View dataset

Document Visual Question Answering - 50K questions on 12K document images

Metric:ANLS (Average Normalized Levenshtein Similarity)

Relevance:Tests document understanding - Qwen3-VL leads at 96.5%

CharXiv

View dataset

Scientific chart understanding and reasoning - 88K questions

Metric:Accuracy on chart comprehension

Relevance:Tests chart/table reasoning - Gemini 3 Pro leads at 91.2%

SROIE

View dataset

Scanned Receipts OCR and Information Extraction - 1,000 receipt images

Metric:F1 Score for entity extraction (company, date, address, total)

Relevance:Direct benchmark for receipt/invoice extraction

Benchmark Limitations

!Benchmarks test general document understanding, not specifically invoice extraction accuracy on YOUR invoice formats.
!Real-world invoices have more variation (scan quality, languages, layouts) than benchmark datasets.
!Always run a pilot test on 100+ of your actual invoices before committing to a solution.

Use Case Recommendations

Technology selection depends on your volume, budget, and compliance requirements. Here are tested recommendations for common scenarios.

Small Business / Startup

<1,000 invoices/month<$100/month

Key Requirements

Quick setupNo infrastructureGood accuracy

Recommended: Gemini 3 Pro

Best OCR accuracy, simple integration (~$4/month for 1K invoices)

Alternative: Gemini 2.5 Pro for even lower cost (~$0.60/month)

Mid-Market / Growth Stage

1,000-50,000 invoices/month$100-$1,000/month

Key Requirements

ReliabilityScalabilityIntegration APIs

Recommended: Gemini 3 Pro

Best OCR, handles poor-quality scans. Use GPT-5.2 as fallback for complex reasoning.

Alternative: GPT-5.2 if you need 400K context for multi-page documents

Enterprise / High Volume

>50,000 invoices/monthOptimize TCO

Key Requirements

Privacy/compliancePredictable costsCustom models

Recommended: Qwen3-VL (self-hosted)

Highest accuracy (96.5% DocVQA). Zero marginal cost. Full data control.

Alternative: Qwen2.5-VL-72B if you need the 7B variant for edge deployment

Regulated Industry (HIPAA/GDPR)

AnyAny

Key Requirements

No data transmissionAudit trailOn-premise

Recommended: Qwen3-VL or Qwen2.5-VL (self-hosted)

Data never leaves your infrastructure. Full compliance control.

Alternative: Azure OpenAI with private endpoints if cloud is acceptable

Implementation Examples

Production-ready code for each major platform. All examples include structured output parsing and error handling.

GPT-5.2 with Structured Outputs

Best for: Production systems requiring structured output

Recommended

pip install openai pydantic

import json
import base64
from openai import OpenAI
from pydantic import BaseModel

client = OpenAI()

# Define structured output schema
class LineItem(BaseModel):
    description: str
    quantity: float
    unit_price: float
    total: float

class Vendor(BaseModel):
    name: str
    address: str | None
    tax_id: str | None

class InvoiceData(BaseModel):
    invoice_number: str
    date: str
    vendor: Vendor
    line_items: list[LineItem]
    subtotal: float
    tax_amount: float
    total: float
    currency: str
    confidence: float

def extract_invoice(image_path: str) -> InvoiceData:
    """Extract structured data from an invoice using GPT-5.2."""
    with open(image_path, 'rb') as f:
        image_data = base64.b64encode(f.read()).decode()

    # GPT-5.2 uses responses.create() with structured outputs
    response = client.responses.parse(
        model="gpt-5.2",
        input=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "input_text",
                        "text": """Extract all invoice data. Verify math:
sum(line_items.total) should equal subtotal.
Set confidence 0-1 based on image quality and extraction certainty."""
                    },
                    {
                        "type": "input_image",
                        "image_url": f"data:image/png;base64,{image_data}"
                    }
                ]
            }
        ],
        text_format=InvoiceData,
        max_output_tokens=2000
    )

    return response.output_parsed

# Usage
invoice_data = extract_invoice("invoice.png")
print(invoice_data.model_dump_json(indent=2))

# Validate extraction
if invoice_data.confidence < 0.9:
    print("Low confidence - manual review recommended")

Claude 3.5 Sonnet

Best for: Multi-page documents, complex reasoning

pip install anthropic PyMuPDF

import anthropic
import base64
import json
from pathlib import Path

client = anthropic.Anthropic()

def extract_invoice(image_path: str) -> dict:
    """Extract structured data from an invoice using Claude 3.5 Sonnet."""
    with open(image_path, 'rb') as f:
        image_data = base64.b64encode(f.read()).decode()

    # Determine media type
    suffix = Path(image_path).suffix.lower()
    media_types = {'.png': 'image/png', '.jpg': 'image/jpeg', '.jpeg': 'image/jpeg', '.pdf': 'application/pdf'}
    media_type = media_types.get(suffix, 'image/png')

    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=2000,
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": media_type,
                        "data": image_data
                    }
                },
                {
                    "type": "text",
                    "text": """Analyze this invoice and extract all data as a JSON object.

Required fields:
- invoice_number (string)
- date (ISO 8601 format)
- vendor (object with name, address)
- line_items (array of objects with description, quantity, unit_price, total)
- subtotal (number)
- tax_amount (number)
- total (number)
- currency (3-letter code)

Reasoning process:
1. First identify all text regions
2. Categorize each region (header, line items, totals)
3. Extract and validate numerical values
4. Cross-check: sum of line items should equal subtotal

Return ONLY the JSON object, no explanation."""
                }
            ]
        }]
    )

    return json.loads(response.content[0].text)

# Multi-page invoice handling
def extract_multipage_invoice(pdf_path: str) -> dict:
    """Handle multi-page invoices using Claude's context window."""
    import fitz  # PyMuPDF

    doc = fitz.open(pdf_path)
    images = []

    for page in doc:
        pix = page.get_pixmap(dpi=150)
        img_bytes = pix.tobytes("png")
        images.append(base64.b64encode(img_bytes).decode())

    content = []
    for i, img_data in enumerate(images):
        content.append({
            "type": "image",
            "source": {"type": "base64", "media_type": "image/png", "data": img_data}
        })
        content.append({"type": "text", "text": f"Page {i+1} of {len(images)}"})

    content.append({
        "type": "text",
        "text": "This is a multi-page invoice. Extract all data across all pages into a single JSON object."
    })

    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=4000,
        messages=[{"role": "user", "content": content}]
    )

    return json.loads(response.content[0].text)

Gemini 3 Pro

Best for: High-volume, cost-sensitive processing

Best OCR

pip install google-generativeai Pillow

import json
import google.generativeai as genai
from PIL import Image

genai.configure(api_key="YOUR_API_KEY")

def extract_invoice(image_path: str) -> dict:
    """Extract structured data from an invoice using Gemini 3 Pro."""
    # Gemini 3 Pro - best-in-class OCR (0.115 OmniDocBench)
    model = genai.GenerativeModel("gemini-3-pro")

    image = Image.open(image_path)

    prompt = """Extract invoice data as JSON with this schema:
{
  "invoice_number": string,
  "date": string,
  "vendor_name": string,
  "line_items": [{"description": string, "quantity": number, "unit_price": number, "total": number}],
  "subtotal": number,
  "tax_amount": number,
  "total": number,
  "currency": string
}

Use spatial reasoning to understand table structure. Handle merged headers.
Return ONLY valid JSON, no markdown code blocks."""

    response = model.generate_content([prompt, image])

    # Parse JSON (handle potential markdown wrapping)
    text = response.text.strip()
    if text.startswith('```'):
        text = text.split('\n', 1)[1].rsplit('\n', 1)[0]

    return json.loads(text)

# Batch processing with rate limiting
import time
from pathlib import Path

def process_batch(folder: str, output_file: str):
    """Process a folder of invoices with rate limiting."""
    results = []
    for i, path in enumerate(Path(folder).glob("*.png")):
        try:
            data = extract_invoice(str(path))
            data["_source_file"] = path.name
            results.append(data)
            print(f"Processed {i+1}: {path.name}")
        except Exception as e:
            results.append({"_source_file": path.name, "_error": str(e)})
        time.sleep(0.5)  # Rate limiting

    with open(output_file, 'w') as f:
        json.dump(results, f, indent=2)

process_batch("invoices/", "extracted_data.json")

Qwen3-VL (Self-Hosted)

Best for: Privacy requirements, high volume

Highest Accuracy

pip install transformers torch accelerate vllm

import torch
from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image
import json

# Load Qwen3-VL (highest DocVQA: 96.5%)
# Use Qwen2.5-VL-7B for smaller deployments
model_name = "Qwen/Qwen3-VL-72B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)

def extract_invoice(image_path: str) -> dict:
    """Extract structured data from an invoice using Qwen3-VL."""
    image = Image.open(image_path).convert("RGB")

    prompt = """<|im_start|>user
<|vision_start|><|image_pad|><|vision_end|>
Extract all invoice data as JSON:
{
  "invoice_number": string,
  "date": string,
  "vendor_name": string,
  "line_items": [{"description": string, "quantity": number, "unit_price": number, "total": number}],
  "subtotal": number,
  "tax_amount": number,
  "total": number
}
<|im_end|>
<|im_start|>assistant
"""

    inputs = processor(
        text=prompt,
        images=image,
        return_tensors="pt"
    ).to(model.device)

    with torch.no_grad():
        output_ids = model.generate(
            **inputs,
            max_new_tokens=2000,
            do_sample=False
        )

    output = processor.decode(output_ids[0], skip_special_tokens=True)

    # Extract JSON from response
    json_start = output.find('{')
    json_end = output.rfind('}') + 1
    return json.loads(output[json_start:json_end])

# Production deployment with vLLM for higher throughput
# pip install vllm
from vllm import LLM, SamplingParams

def create_production_pipeline():
    """High-throughput pipeline using vLLM."""
    llm = LLM(
        model="Qwen/Qwen2-VL-7B-Instruct",
        tensor_parallel_size=2,  # Use 2 GPUs
        max_model_len=4096
    )
    sampling_params = SamplingParams(temperature=0, max_tokens=2000)
    return llm, sampling_params

Production Patterns

Critical infrastructure for reliable invoice processing. These patterns apply regardless of which VLM you choose.

Validation Layer

Always validate extracted data before processing

def validate_invoice(data: dict) -> tuple[bool, list[str]]:
    """Validate extracted invoice data."""
    errors = []

    # Check required fields
    required = ['invoice_number', 'date', 'total', 'line_items']
    for field in required:
        if not data.get(field):
            errors.append(f"Missing required field: {field}")

    # Validate line item math
    if data.get('line_items') and data.get('subtotal'):
        calc_subtotal = sum(item.get('total', 0) for item in data['line_items'])
        if abs(calc_subtotal - data['subtotal']) > 0.01:
            errors.append(f"Line items ({calc_subtotal}) don't match subtotal ({data['subtotal']})")

    # Validate tax calculation
    if data.get('subtotal') and data.get('tax_amount') and data.get('total'):
        expected_total = data['subtotal'] + data['tax_amount']
        if abs(expected_total - data['total']) > 0.01:
            errors.append(f"Tax calculation mismatch")

    return len(errors) == 0, errors

Confidence Scoring

Implement confidence thresholds for human review

def should_flag_for_review(data: dict, thresholds: dict = None) -> tuple[bool, str]:
    """Determine if invoice needs human review."""
    thresholds = thresholds or {
        'min_confidence': 0.85,
        'max_amount': 10000,
        'required_fields_ratio': 0.9,
    }

    # Model-reported confidence
    if data.get('confidence', 1.0) < thresholds['min_confidence']:
        return True, "Low model confidence"

    # High-value invoices always reviewed
    if data.get('total', 0) > thresholds['max_amount']:
        return True, "High value invoice"

    # Missing critical fields
    required = ['invoice_number', 'date', 'vendor', 'total']
    present = sum(1 for f in required if data.get(f))
    if present / len(required) < thresholds['required_fields_ratio']:
        return True, "Missing critical fields"

    return False, ""

Batch Processing Pipeline

Async processing with retry logic

import asyncio
from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=60))
async def process_invoice_with_retry(image_path: str) -> dict:
    """Process single invoice with exponential backoff retry."""
    return await asyncio.to_thread(extract_invoice, image_path)

async def process_batch(image_paths: list[str], max_concurrent: int = 5) -> list[dict]:
    """Process multiple invoices with controlled concurrency."""
    semaphore = asyncio.Semaphore(max_concurrent)

    async def process_with_limit(path):
        async with semaphore:
            try:
                result = await process_invoice_with_retry(path)
                result['_status'] = 'success'
                result['_source'] = path
            except Exception as e:
                result = {'_status': 'failed', '_source': path, '_error': str(e)}
            return result

    tasks = [process_with_limit(p) for p in image_paths]
    return await asyncio.gather(*tasks)

# Usage
results = asyncio.run(process_batch(["inv1.png", "inv2.png", "inv3.png"]))
success = [r for r in results if r['_status'] == 'success']
failed = [r for r in results if r['_status'] == 'failed']

Production Readiness Checklist

[ ]Validation layer for math verification
[ ]Confidence thresholds for human review
[ ]Retry logic with exponential backoff
[ ]Rate limiting to stay within API quotas

[ ]Logging for debugging and audit trail
[ ]Dead letter queue for failed extractions
[ ]Monitoring for accuracy drift
[ ]Fallback to secondary model on primary failure

Cost Analysis

Monthly Volume	GPT-5.2	Claude Opus 4.5	Gemini 3 Pro	Qwen (self-hosted)
1,000 invoices	$6	$45	$4	$2,000*
10,000 invoices	$60	$450	$40	$2,000*
100,000 invoices	$600	$4,500	$400	$2,000*
1,000,000 invoices	$6,000	$45,000	$4,000	$2,500*

* Self-hosted costs include GPU infrastructure (2x A100) and engineering overhead. Actual costs vary based on cloud provider and utilization.

Break-Even Analysis

Self-hosted Qwen3-VL becomes cost-effective at approximately:

330K invoices/monthvs GPT-5.2
500K invoices/monthvs Gemini 3 Pro

Note: This calculation assumes you have ML engineering expertise. If you need to hire, add $150K-250K/year in labor costs.

Need Help Choosing?

We provide technical evaluations on your actual documents. See how each model performs on your invoice formats before committing.

Request Evaluation Browse OCR Resources

Last updated: December 2025 | Based on benchmark data and production testing

Back to all guides

Invoice Processing with Vision Language Models

Executive Summary

Key Findings (December 2025)

Model Comparison

GPT-5.2

Gemini 3 Pro

Claude Opus 4.5

Gemini 2.5 Pro

Benchmark Datasets

OmniDocBench

DocVQA

CharXiv

SROIE

Benchmark Limitations

Use Case Recommendations

Small Business / Startup

Mid-Market / Growth Stage

Enterprise / High Volume

Regulated Industry (HIPAA/GDPR)

Implementation Examples

GPT-5.2 with Structured Outputs

Claude 3.5 Sonnet

Gemini 3 Pro

Qwen3-VL (Self-Hosted)

Production Patterns

Validation Layer

Confidence Scoring

Batch Processing Pipeline

Production Readiness Checklist

Cost Analysis

Break-Even Analysis

Related Resources

Traditional OCR for Invoices

Document to Structured Data

Executive: Document Processing Selection

Document Understanding Benchmarks

Need Help Choosing?