Codesota · OCR · Ship ItHome/OCR/Ship It

Implementation guide

You know the best OCR model.
Now ship it.

You've read the benchmarks. You know PaddleOCR beats Tesseract. Now what?

Most people stop at the comparison table. Here's how to have OCR running before your coffee gets cold.

FREE · BETA

Skip the install — try Hardparse

Our own hosted parsing API. Zero install, one curl, structured JSON out. Free while in beta.

→

Answer 3 questions. Get one answer.

No comparison table. Just the right tool for your situation.

What are you processing?

Where must your data stay?

Budget?

Implementation

FREE (BETA)HOSTED API~0 min install, ~2 sec to run

Hardparse

Zero install. Hosted parsing API that combines structure-aware extraction (tables, reading order) with vision fallback for handwriting. Free while in beta — made by the CodeSOTA team.

Install

curl -X POST https://hardparse.com/api/parse -F "file=@doc.pdf"

Working code

# No install. Just curl.
curl -X POST https://hardparse.com/api/parse \
  -F "file=@your-document.pdf"

# Returns structured JSON: text, tables, layout, reading order.
# Python one-liner:
# requests.post("https://hardparse.com/api/parse", files={"file": open("doc.pdf","rb")}).json()

Expected output

{
  "text": "Invoice INV-2025-001...",
  "tables": [
    {
      "rows": [
        ["Description", "Qty", "Price", "Total"],
        ["Web Development", "40", "$150.00", "$6,000.00"],
        ["UI/UX Design", "20", "$125.00", "$2,500.00"]
      ]
    }
  ],
  "pages": 1,
  "reading_order": "preserved"
}

Common gotcha

It's in beta — no SLA yet, rate limits are generous but not unlimited. Good for prototypes, pilots, and moderate production volume. For 10k+ pages/day, talk to us or fall back to PaddleOCR.

FREELOCAL~3 min install, ~30 sec to run

PaddleOCR

99.6% accuracy on invoices. $0 cost. Runs on your machine. Apache 2.0 license.

Install

pip install paddlepaddle paddleocr

Working code

# pip install paddlepaddle paddleocr
from paddleocr import PaddleOCR

ocr = PaddleOCR(lang='en')
result = ocr.predict('your-image.png')

for item in result:
    for text in item.get('rec_texts', []):
        print(text)

Expected output

INVOICE
Invoice #: INV-2025-001
Date: December 16, 2025
Bill To:
John Smith
123 Main Street
Web Development Services
40
$150.00
$6,000.00

Common gotcha

First run downloads ~150MB of model files. It'll hang for a minute — that's normal. Subsequent runs are fast.

API~$0.01/page~1 min install, ~2 sec to run

GPT-5.4

Best for handwriting and understanding context. Preserves table structure. Handles cursive reliably. Successor to GPT-4o.

Install

pip install openai

Working code

# pip install openai
import base64
from openai import OpenAI

client = OpenAI()  # uses OPENAI_API_KEY env var

with open('your-image.png', 'rb') as f:
    img = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": [
        {"type": "text", "text": "Extract all text from this image."},
        {"type": "image_url", "image_url": {
            "url": f"data:image/png;base64,{img}"
        }}
    ]}]
)

print(response.choices[0].message.content)

Expected output

INVOICE

Invoice #: INV-2025-001
Date: December 16, 2025

Bill To:
John Smith
123 Main Street

Description          Qty    Price       Total
Web Development      40     $150.00     $6,000.00
UI/UX Design         20     $125.00     $2,500.00

Common gotcha

You need an OPENAI_API_KEY environment variable set. Get one at platform.openai.com. Costs ~$0.01 per image.

FREELOCAL~5 min install, ~10 sec to run

Docling

IBM's document understanding library. Structure-aware — preserves tables, headers, reading order. Best for PDFs and multi-page docs.

Install

pip install docling

Working code

# pip install docling
from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("your-document.pdf")

print(result.document.export_to_markdown())

Expected output

# Invoice INV-2025-001

**Date:** December 16, 2025
**Bill To:** John Smith, 123 Main Street

| Description | Qty | Price | Total |
|---|---|---|---|
| Web Development | 40 | $150.00 | $6,000.00 |
| UI/UX Design | 20 | $125.00 | $2,500.00 |

**Subtotal:** $8,980.00

Common gotcha

Docling downloads large model files on first run (~1GB). Install can take 5+ minutes due to dependencies. Works best with PDFs — for raw images, use PaddleOCR instead.

FREELOCAL~2 min install, ~1 sec to run

Tesseract

The classic. Been around since 2006. Lowest accuracy of the four, but easiest to install and fastest to run. Good enough for clean printed text.

Install

# macOS: brew install tesseract
# Ubuntu: sudo apt install tesseract-ocr
pip install pytesseract pillow

Working code

# pip install pytesseract pillow
# Also install: sudo apt install tesseract-ocr
import pytesseract
from PIL import Image

image = Image.open('your-image.png')
text = pytesseract.image_to_string(image)
print(text)

Expected output

INVOICE
Invoice #: INV-2025-001
Date: December 16, 2025
Bill To:
John Smith
123 Main Street
San Francisco, CA 94102

Description Qty Price Total
Web Development Services 40 $150.00 $6,000.00

Common gotcha

You need the system-level tesseract binary installed separately from the Python package. pip install pytesseract alone won't work.

What If It Doesn't Work?

Every model has blind spots. Here's what to watch for.

Hardparse failure modes

Fails at	What happens	Use instead
Air-gapped / offline	Hosted API requires internet	PaddleOCR
Strict data residency	Files transit our servers during beta	Docling
Extreme scale (10k+/day)	Beta rate limits apply	PaddleOCR

PaddleOCR failure modes

Fails at	What happens	Use instead
Handwriting	Garbled output, low confidence	GPT-5.4
Complex tables	Loses structure, flat text output	Docling
Multi-page PDFs	Image-only, no page awareness	Docling

GPT-5.4 failure modes

Fails at	What happens	Use instead
High volume (10k+ pages)	$100+ cost, rate limits	PaddleOCR
Data privacy requirements	Data sent to OpenAI servers	PaddleOCR
Offline / air-gapped	No internet = no OCR	Tesseract

Docling failure modes

Fails at	What happens	Use instead
Raw images / photos	Designed for documents, not scene text	PaddleOCR
Handwriting	Poor recognition on non-printed text	GPT-5.4
Speed-critical paths	Slower than alternatives (~10s/page)	Tesseract

Tesseract failure modes

Fails at	What happens	Use instead
Low-quality scans	Misreads characters, swaps similar glyphs	PaddleOCR
Tables / layout	No structure awareness at all	Docling
Non-Latin scripts	Requires separate language packs, still weak	PaddleOCR

Deep dive: Full failure mode taxonomy | Decision guide

§ 02 · Scale

Scale it.

You just built a prototype. Here's the path to production.

Prototype

What you just built

1-10 docs, manual

— Single file in, text out
— Run from terminal
— Eyeball the results

Batch

Add this next

100-10k docs, scripted

— Loop over a directory
— Try/except per file
— Log failures to CSV
— Confidence threshold filter

import os, csv
for f in os.listdir('docs/'):
    try:
        result = ocr.predict(f'docs/{f}')
        # save to output/
    except Exception as e:
        log.append([f, str(e)])

Production

You need this at scale

10k+ docs, monitored