OCR
OCR, or Optical Character Recognition, is the task of converting an image containing text into machine-readable, editable, and searchable digital text data. This involves converting scanned documents, photos, or image-only PDFs to text from their static visual format, enabling the document to be edited, searched, or used for data entry and other applications. Examples include digitizing receipts for your bank app, translating signs with Google Translate, or creating searchable archives from old documents.
OCR is a key task in computer vision. Below you will find the standard benchmarks used to evaluate models, along with current state-of-the-art results.
Benchmarks & SOTA
OCRBench
OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models
OCRBench is a comprehensive evaluation benchmark designed to assess the OCR capabilities of Large Multimodal Models. It comprises five components: Text Recognition, SceneText-Centric VQA, Document-Oriented VQA, Key Information Extraction, and Handwritten Mathematical Expression Recognition. The benchmark includes 1000 question-answer pairs, and all the answers undergo manual verification and correction to ensure a more precise evaluation.
State of the Art
HunyuanOCR (1B)
860
Score
OmniDocBench v1.0
OmniDocBench v1.0
OmniDocBench is a benchmark featuring high-quality annotations across nine document sources, including academic papers, textbooks, and more challenging cases such as handwritten notes and densely typeset newspapers. OmniDocBench supports flexible, multi-level evaluations--ranging from an end-to-end assessment to the task-specific and attribute--based analysis using 19 layout categories and 15 attribute labels.
No results tracked yet
OmniDocBench v1.5
OmniDocBench v1.5
OmniDocBench v1.5 is an expansion of version v1.0, adding 374 new documents for a total of 1,355 document pages. It features a more balanced distribution of data in both Chinese and English, as well as a richer inclusion of formulas and other elements. The evaluation method has been updated, with formulas assessed using the CDM method. The overall metric is a weighted combination of the metrics for text, formulas, and tables.
No results tracked yet
olmOCR-Bench
olmOCR-Bench
olmOCR-bench is an evaluation dataset of 1,403 PDF files designed to test how well Optical Character Recognition (OCR) systems can convert PDFs into clean markdown, especially preserving complex structures like tables, equations, and natural reading order. It's used in conjunction with the olmOCR toolkit, an open-source tool for accurate PDF-to-text conversion that uses a vision language model.
No results tracked yet
Fox (English subset, 600-1300 text tokens)
Fox — English subset (pages with 600–1300 text tokens)
English subset of the Fox benchmark for fine-grained multi-page document understanding (PDF page images + page-level annotations). The Fox benchmark was introduced in the paper "Focus Anywhere for Fine-grained Multi-page Document Understanding" (arXiv:2405.14295). This English subset (Fox-Page-En) contains PDF page images and OCR/annotation files drawn from the Fox benchmark for page-level evaluation. In the paper the authors report experiments on a selection of pages with 600-1300 text tokens (documents tokenized with the DeepSeek-OCR tokenizer, vocab ~129k) and state they selected 100 pages in that token range for a particular evaluation; the paper reports precision values for different numbers of vision tokens (e.g., 64 and 100) across token bins. The Fox project provides code and benchmark data via the project GitHub (https://github.com/ucaslcl/Fox). A convenient Hugging Face-hosted subset (EduardoPacheco/Fox-Page-En) is available for easy access; the HF subset page and associated issue note indicate this HF subset contains 112 English page samples (as maintained by the HF contributor). Use cases: page-level OCR/region-level OCR, region-level summarization/translation, and other fine-grained document understanding evaluations for LVLMs.
No results tracked yet
Related Tasks
Few-Shot Image Classification
Image classification with limited labeled examples per class (few-shot learning). Models are evaluated on their ability to classify images into categories with only a handful of training examples (typically 1-10) per class.
Open-Vocabulary Object Detection
Object detection with open vocabulary - detecting objects from arbitrary text descriptions without being limited to a fixed set of categories.
Object counting
Object counting in AI is a computer vision task that uses machine learning and image processing to identify and enumerate distinct objects within digital images and videos. It can differentiate between various object types, sizes, and shapes, even in crowded or dynamically changing scenes. The process typically involves object detection using deep learning models like convolutional neural networks (CNNs) to recognize and localize objects, followed by aggregation to provide a total count. This technology is applied in fields like manufacturing for quality control and production monitoring.
Video segmentation
Video segmentation is the task of partitioning video frames into multiple segments or objects. Unlike image segmentation which works on static images, video segmentation tracks objects across frames in a video sequence.
Get notified when these results update
New models drop weekly. We track them so you don't have to.
Something wrong or missing?
Help keep OCR benchmarks accurate. Report outdated results, missing benchmarks, or errors.