Arabic OCR, measured by error.
KITAB-Bench is the MBZUAI benchmark for Arabic OCR and document understanding. It scores systems on character error rate against native-script test material — the one metric where the gap between frontier VLMs and classical OCR pipelines is still visible at a glance.
Character Error Rate, ranked.
Normalised Levenshtein distance between predicted and ground-truth Arabic text. (lower is better)
| # | Model | Character Error Rate | Verified | Source |
|---|---|---|---|---|
| 01 | gemini-20-flash Non-API entry from src | 0.13 | — | src |
| 02 | ain-7b Non-API entry from src | 0.20 | — | src |
| 03 | gpt-4o Non-API entry from src | 0.31 | — | src |
| 04 | gpt-4o-mini Non-API entry from src | 0.43 | — | src |
| 05 | azure-ocr Non-API entry from src | 0.52 | — | src |
| 06 | tesseract Non-API entry from src | 0.54 | — | src |
| 07 | easyocr Non-API entry from src | 0.58 | — | src |
| 08 | paddleocr Non-API entry from src | 0.79 | — | src |
Character error rate, nothing else.
KITAB-Bench reports character error rate (CER) — the Levenshtein distance between the predicted transcription and the ground-truth text, normalised by string length. Lower is better, and in Arabic the decimal gap between 0.13 (Gemini 2.0 Flash) and 0.79 (PaddleOCR stock) is not a rounding error: it is the difference between a usable pipeline and an unusable one.
The benchmark rewards systems that handle right-to-left script, complex ligatures and diacritics in continuous text. Vendor APIs designed with Arabic as a first-class language route dominate the top of the table.
8,809 samples, 9 domains.
KITAB-Bench, released by MBZUAI, contains 8,809 samples spread across 9 domains of Arabic text — printed documents, handwriting, scene text and more. It is the reference bench for any vendor that claims Arabic OCR support.
Upstream documentation sits at alphaXiv.
Reported, then reproduced.
Closed-API models are run against the public KITAB-Bench split through the vendor endpoint; open systems (Tesseract, PaddleOCR, EasyOCR) are executed locally with a pinned version. CER is computed with the canonical normalisation published alongside the benchmark.
Full reproduction policy: /methodology.