| 01 | Document OCR Reading text, structure, and layout from document images. | OCRBench v2 public overall submetricagingambiguous Scope is public overall. Do not compare directly with English-private OCRBench v2 or full document parsing metrics. | Qwen2.5-VL-72B | 63.70 overall | 831 |
| 02 | Scene Text Detection Detecting text regions in natural scene images | COCO-Text detection scope needs review misclassifiedstalemisclassified CLIP4STR-style scene text recognition rows do not belong under detection. Detection needs region metrics such as precision, recall, F-measure, or hmean. | — | — | 581 |
| 03 | Document Parsing Parsing document structure and content | OmniDocBench v1.5 submetricagingambiguous Reading order is only one OmniDocBench facet. Summary SOTA needs text, layout, table TEDS, reading order, and end-to-end structure facets. | JT-OCR | 92.09 composite | 149 |
| 04 | Document Layout Analysis Analyzing the layout structure of documents | d4la | DoPTA | 70.7% map | 133 |
| 05 | Scene Text Recognition Recognizing text in natural scene images | cute80 | CPPD | 99.7% accuracy | 127 |
| 06 | Object Detection Object Detection is a computer vision task that involves identifying and localizing objects within an image. T… | Microsoft Common Objects in Context | ScyllaNet | 66.12 box-map | 104 |
| 07 | Image Classification Image Classification is a fundamental task in computer vision that aims to assign a label or class to an entir… | ImageNet Large Scale Visual Recognition Challenge 2012 | pMF-H + FD-loss | 72.0% pass@1 | 87 |
| 08 | Table Recognition Detecting and parsing tables in documents | ICDAR2013 table structure (legacy) legacylegacyambiguous ICDAR2013 is too narrow for 2026 table recognition. Promote PubTables-1M, PubTabNet, FinTabNet, or table-specific document parsing metrics. | Proposed System (With post- processing) | 95.46 f-measure | 71 |
| 09 | General OCR Capabilities Comprehensive benchmarks covering multiple aspects of OCR performance. | OCRBench v2 needs coveragestaleambiguous Fold this into OCR unless the metric scope is explicit: public overall, English-private, recognition, understanding, or full parsing. | — | — | 70 |
| 10 | Document Image Classification Classifying documents by type or category | aip | ResNet-RS (ResNet-200 + RS training tricks) | 83.40 top-1-accuracy-verb | 63 |
| 11 | Handwriting Recognition Recognizing handwritten text | — | — | — | 40 |
| 12 | Document Understanding Document understanding requires parsing visually rich documents — invoices, forms, scientific papers, tables —… | Form Understanding in Noisy Scanned Documents | — | — | 28 |
| 13 | Semantic Segmentation Semantic segmentation assigns a class label to every pixel — the dense prediction problem that underpins auton… | ADE20K Scene Parsing Benchmark | InternImage-H | 62.9% mIoU | 24 |
| 14 | Video classification The task of classifying videos into predefined categories or classes. Video classification involves analyzing… | Kinetics-400 | DINOv3 (7B) | 88.2% accuracy | 13 |
| 15 | Image segmentation Image segmentation is a computer vision technique that divides a digital image into multiple parts or "segment… | — | — | — | 3 |
| 16 | Keypoint Detection Keypoint detection localizes specific anatomical or structural landmarks — body joints, facial features, hand… | COCO Keypoints | ViTPose-G | 80.9% map | 1 |
| 17 | OCR OCR, or Optical Character Recognition, is the task of converting an image containing text into machine-readabl… | — | — | — | 1 |