Codesota · OCR · Benchmark · KITAB-Bench8 scored runs · 8 distinct modelsUpdated 2026-04-20

§ 00 · Opening

Arabic OCR, measured by error.

KITAB-Bench is the MBZUAI benchmark for Arabic OCR and document understanding. It scores systems on character error rate against native-script test material — the one metric where the gap between frontier VLMs and classical OCR pipelines is still visible at a glance.

§ 01 · Leaderboard · Character Error Rate

Character Error Rate, ranked.

Normalised Levenshtein distance between predicted and ground-truth Arabic text. (lower is better)

#	Model	Character Error Rate	Verified	Source
01	gemini-20-flash Non-API entry from src	0.13	—	src
02	ain-7b Non-API entry from src	0.20	—	src
03	gpt-4o Non-API entry from src	0.31	—	src
04	gpt-4o-mini Non-API entry from src	0.43	—	src
05	azure-ocr Non-API entry from src	0.52	—	src
06	tesseract Non-API entry from src	0.54	—	src
07	easyocr Non-API entry from src	0.58	—	src
08	paddleocr Non-API entry from src	0.79	—	src

Fig · 8 results on Character Error Rate. Rows sourced from benchmarks.json; shaded row marks current SOTA.

§ What it measures

Character error rate, nothing else.

KITAB-Bench reports character error rate (CER) — the Levenshtein distance between the predicted transcription and the ground-truth text, normalised by string length. Lower is better, and in Arabic the decimal gap between 0.13 (Gemini 2.0 Flash) and 0.79 (PaddleOCR stock) is not a rounding error: it is the difference between a usable pipeline and an unusable one.

The benchmark rewards systems that handle right-to-left script, complex ligatures and diacritics in continuous text. Vendor APIs designed with Arabic as a first-class language route dominate the top of the table.

§ Dataset details

8,809 samples, 9 domains.

KITAB-Bench, released by MBZUAI, contains 8,809 samples spread across 9 domains of Arabic text — printed documents, handwriting, scene text and more. It is the reference bench for any vendor that claims Arabic OCR support.

Upstream documentation sits at alphaXiv.

§ How scores are verified

Reported, then reproduced.

Closed-API models are run against the public KITAB-Bench split through the vendor endpoint; open systems (Tesseract, PaddleOCR, EasyOCR) are executed locally with a pinned version. CER is computed with the canonical normalisation published alongside the benchmark.

Full reproduction policy: /methodology.

§ Final · Related OCR benchmarks