Codesota · OCR · Benchmark · KITAB-Bench8 scored runs · 8 distinct modelsUpdated 2026-04-20
§ 00 · Opening

Arabic OCR, measured by error.

KITAB-Bench is the MBZUAI benchmark for Arabic OCR and document understanding. It scores systems on character error rate against native-script test material — the one metric where the gap between frontier VLMs and classical OCR pipelines is still visible at a glance.

§ 01 · Leaderboard · Character Error Rate

Character Error Rate, ranked.

Normalised Levenshtein distance between predicted and ground-truth Arabic text. (lower is better)

#ModelCharacter Error RateVerifiedSource
01gemini-20-flash
Non-API entry from src
0.13src
02ain-7b
Non-API entry from src
0.20src
03gpt-4o
Non-API entry from src
0.31src
04gpt-4o-mini
Non-API entry from src
0.43src
05azure-ocr
Non-API entry from src
0.52src
06tesseract
Non-API entry from src
0.54src
07easyocr
Non-API entry from src
0.58src
08paddleocr
Non-API entry from src
0.79src
Fig · 8 results on Character Error Rate. Rows sourced from benchmarks.json; shaded row marks current SOTA.
§ What it measures

Character error rate, nothing else.

KITAB-Bench reports character error rate (CER) — the Levenshtein distance between the predicted transcription and the ground-truth text, normalised by string length. Lower is better, and in Arabic the decimal gap between 0.13 (Gemini 2.0 Flash) and 0.79 (PaddleOCR stock) is not a rounding error: it is the difference between a usable pipeline and an unusable one.

The benchmark rewards systems that handle right-to-left script, complex ligatures and diacritics in continuous text. Vendor APIs designed with Arabic as a first-class language route dominate the top of the table.

§ Dataset details

8,809 samples, 9 domains.

KITAB-Bench, released by MBZUAI, contains 8,809 samples spread across 9 domains of Arabic text — printed documents, handwriting, scene text and more. It is the reference bench for any vendor that claims Arabic OCR support.

Upstream documentation sits at alphaXiv.

§ How scores are verified

Reported, then reproduced.

Closed-API models are run against the public KITAB-Bench split through the vendor endpoint; open systems (Tesseract, PaddleOCR, EasyOCR) are executed locally with a pinned version. CER is computed with the canonical normalisation published alongside the benchmark.

Full reproduction policy: /methodology.

§ Final · Related OCR benchmarks

Cross-links, sibling leaderboards.