POLISH SOTAOPEN SOURCER&D

Rys OCR

State-of-the-art Polish text recognition. Fine-tuned for correct handling of Polish diacritics (a, c, e, l, n, o, s, z, z). First release for ongoing R&D.

First Fine-Tune Results
71.3%
CER Reduction
5.58% to 1.60%
46.1%
WER Reduction
13.37% to 7.21%
10k
Training Images
Synthetic Polish documents

Model Architecture

Base Model
PaddleOCR-VL
Parent Base
ERNIE-4.5-0.3B
Method
LoRA (Low-Rank Adaptation)
LoRA Rank
16
LoRA Alpha
32
Target Modules
q_proj, k_proj, v_proj, o_proj
VRAM Required
4-6 GB
License
Apache 2.0

Training Data

10,000 synthetic Polish document images across 7 categories:

AddressesInvoice linesReceipt linesDatesNamesPricesPhrases

Training: 1 epoch, AdamW optimizer, linear LR schedule

Framework: PEFT 0.18.0 + Transformers

Benchmark Results

MetricBaselineFine-tunedImprovement
Character Error Rate (CER)5.58%1.60%v 71.3%
Word Error Rate (WER)13.37%7.21%v 46.1%
Exact Match74%76%^ 2%

Key improvement: Resolved Polish diacritic confusion (l, e, s, etc.)

Quick Start

python
from transformers import AutoModelForCausalLM, AutoProcessor
from peft import PeftModel
from PIL import Image

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "PaddlePaddle/PaddleOCR-VL",
    trust_remote_code=True,
    torch_dtype="auto",
    device_map="auto"
)

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "anon13370/RysOCR")

processor = AutoProcessor.from_pretrained(
    "anon13370/RysOCR",
    trust_remote_code=True
)

# Run inference
image = Image.open("your_document.png")
prompt = "OCR: "

inputs = processor(images=image, text=prompt, return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}

outputs = model.generate(**inputs, max_new_tokens=256)
text = processor.decode(outputs[0], skip_special_tokens=True)
print(text)

Help Build Polish OCR SOTA

This is the first fine-tune in ongoing R&D. We need your help to push Polish OCR to the next level.

Contribute Datasets

Real Polish documents needed: invoices, receipts, historical documents, handwritten notes, street signs.

  • - Scanned documents with ground truth
  • - Photos of Polish text in the wild
  • - Historical Polish manuscripts
  • - Specialized domain texts (medical, legal)
Submit Dataset

Run Benchmarks

Help us evaluate Rys OCR on more Polish-specific benchmarks and compare with other models.

  • - Polish document benchmarks
  • - Diacritic-specific test sets
  • - Cross-model comparisons
  • - Domain-specific evaluations
Submit Results

Join R&D

Collaborate on next iterations: architecture experiments, training strategies, deployment optimization.

  • - Model architecture research
  • - Training pipeline improvements
  • - Edge deployment optimization
  • - Multi-language expansion
Get Involved

Roadmap

v0.1 - First Fine-Tune
10k synthetic images, LoRA on PaddleOCR-VL. 71% CER reduction.
2
v0.2 - Real Data
Train on real Polish documents. Expand domain coverage.
3
v0.3 - Handwriting
Add handwritten Polish text recognition capability.
4
v1.0 - Production Ready
Full benchmark coverage, optimized inference, API deployment.

Known Limitations

  • - Optimized for printed Polish text; handwritten recognition may vary
  • - Best results on clean document scans
  • - Requires loading both base model and LoRA weights for inference
  • - Trained on synthetic data only (v0.1)

Related Reading