Document OCR
Reading text, structure, and layout from document images.
- Benchmark:
- OCRBench v2
- Leader:
- Qwen2.5-VL-72B · 63.70 overall
- Trust:
- Unknown
Find the benchmark evidence behind any AI capability.
Start with a capability — OCR, coding agents, ASR, RAG, VQA, forecasting — then inspect the benchmark, metric, leading model, date, and trust grade.
SOTA is not a model. It is a claim about a model on a benchmark. This view currently surfaces 62 task pages across nine capability areas.
What are you trying to evaluate?
Reading text, structure, and layout from document images.
Analyzing the layout structure of documents
Parsing document structure and content
Detecting and parsing tables in documents
Recognizing handwritten text
Document QA and extraction beyond plain OCR; inspect layout and parsing evidence separately.
Start with the capability. Then inspect the evidence.
A task describes what the system must do: OCR, code generation, ASR, retrieval, VQA, detection.
Find benchmark →A benchmark supplies evidence: metric definition, result counts, source quality, trust grade, and current top rows.
Open leaderboards →A lineage tells you whether a benchmark is canonical, saturated, stale, contested, or replaced by a harder successor.
View evolution →Choose the area closest to the decision you are making. Domains, modalities, methods, and safety properties are overlays — not separate root taxonomies.
Retrieval, QA, reasoning, factuality, text classification, and knowledge-heavy language tasks.
Use when: evaluating answers grounded in text, documents, memory, retrieval, or general language knowledge.
Common trap: chat preference is not retrieval quality; use task-specific evidence for RAG, QA, and factuality.
OCR, layout, tables, parsing, detection, segmentation, and visual extraction.
Use when: evaluating document AI or visual extraction systems.
Common trap: OCR is not document understanding; layout, tables, handwriting, and VQA need separate evidence.
ASR, TTS, speaker intelligence, music, sound events, and audio-language tasks.
Use when: choosing speech, voice, transcription, or audio understanding systems.
Common trap: low WER does not guarantee diarization, latency, speaker identity, or noisy-call performance.
VQA, image-text retrieval, video QA, document VQA, image generation, and editing.
Use when: the input or output crosses text, image, video, audio, or document boundaries.
Common trap: one multimodal score can hide failures in OCR-heavy, spatial, chart, or video tasks.
Code generation, repair, repository work, tests, security, and UI/mobile code.
Use when: evaluating software output, coding assistants, or repository-level model behavior.
Common trap: HumanEval-style synthesis is not the same as issue resolution or production engineering.
Tool calling, web agents, OS tasks, long-horizon autonomy, and workflow execution.
Use when: the model must plan, use tools, recover from errors, or act over multiple steps.
Common trap: agent benchmark wins may depend on scaffolding, tools, browser setup, and budget, not only the base model.
Tables, tabular prediction, time series, anomalies, recommenders, graphs, and optimization.
Use when: choosing evidence for numerical, temporal, graph, or business-data decisions.
Common trap: forecasting and tabular claims are sensitive to split design, leakage, and horizon definition.
Game play, continuous control, navigation, manipulation, VLA models, drones, and driving.
Use when: evaluating embodied systems, control policies, or simulation-to-real claims.
Common trap: simulator scores rarely transfer cleanly to hardware without environment and safety evidence.
Medical, clinical, scientific, industrial, legal, finance, climate, and compliance AI.
Use when: the decision depends on a regulated domain, specialized data, or high-cost errors.
Common trap: general benchmarks can miss domain drift, licensing, annotation quality, and clinical workflow constraints.
Language understanding, retrieval, QA, RAG, factuality, and knowledge extraction. Reasoning appears here as a capability tag, not as a separate root.
Current registered leader means the best result currently registered for this benchmark and metric. It does not mean universally best model.
| Task | Best first benchmark | Current registered leader | Trust | Status | Results | Actions |
|---|---|---|---|---|---|---|
| Commonsense Reasoning Commonsense reasoning — answering questions that require everyday knowledge about how the physical and social… | Massive Multitask Language Understanding | Registered leader o3 92.9% · accuracy | B | Fragmented* | 82 | |
| Mathematical Reasoning Mathematical reasoning benchmarks — GSM8K, MATH, Minerva, and the competition-level AIME/AMC tests — have beco… | Mathematics Aptitude Test of Heuristics | Registered leader Claude Opus 4.5 90.7% · accuracy | — | Canonical* | 79 | |
| Multi-step Reasoning Multi-step reasoning — maintaining coherent inference chains across 5+ sequential steps — is the meta-capabili… | Graduate-Level Google-Proof Q&A Diamond | Registered leader Gemini 2.5 Pro 84.0% · accuracy | B | Canonical* | 53 | |
| Question Answering Question answering now spans extractive reading comprehension, open-domain retrieval QA, multi-hop reasoning,… question-answering | Natural Questions: a Benchmark for Question Answering Research | Not enough registered evidence | B | Stale | 26 | |
| Text Summarization Text summarization compresses documents while preserving key information — a task that became dramatically mor… summarization | CNN/DailyMail Summarization | Registered leader BRIO 47.8% · rouge-1 | — | Canonical* | 15 | |
| Logical Reasoning Logical reasoning — formal deduction, constraint satisfaction, and syllogistic inference — exposes a core weak… | LogiQA | Registered leader GPT-4o 56.3% · accuracy | — | Canonical* | 12 | |
| Natural Language Inference Determining entailment relationships between sentences (SNLI, MNLI). | Stanford Natural Language Inference | Registered leader GPT-4o 92.6% · accuracy | — | Canonical* | 8 | |
| Text Ranking Text ranking is the invisible backbone of every search engine and RAG pipeline. The field was transformed by C… text-ranking | BEIR | Registered leader NV-Embed-v2 62.65 · ndcg@10 | — | Canonical* | 8 | |
| Named Entity Recognition Named entity recognition (NER) extracts structured mentions — people, organizations, locations, dates — from u… token-classification | CoNLL-2003 Named Entity Recognition | Registered leader GLiNER-multitask 93.8% · f1 | — | Canonical* | 7 | |
| Arithmetic Reasoning Arithmetic reasoning — solving computation-heavy problems stated in natural language — tests whether models ca… | Math Word Problem Repository | Registered leader GPT-4o 97.2% · accuracy | — | Canonical* | 6 | |
| Text Embeddings Generating dense vector embeddings for retrieval, ranking, clustering, and semantic search. feature-extractionWhy this benchmark?Why this benchmark?: MTEB-style evidence is the practical first stop for embedding and retrieval model choices. What it measures: Retrieval, ranking, classification, clustering, and semantic similarity tasks. What it misses: Your corpus, latency budget, reranker pairing, and domain-specific relevance labels. | MTEB Leaderboard | Registered leader NV-Embed-v2 72.31 · avg-score | — | Canonical | 6 | |
| Entity Linking Linking mentions to knowledge base entities. | AIDA-CoNLL-YAGO (test-b) | Registered leader GENRE 93.30 · micro_f1 | — | Sparse* | 3 | |
| Knowledge Graph Completion Predicting missing links in knowledge graphs. | FB15k-237 Knowledge Graph Completion | Registered leader NBFNet 0.415 · mrr | — | Sparse* | 3 | |
| Relation Extraction Extracting relationships between entities from text. | TAC Relation Extraction Dataset | Registered leader LUKE 72.7% · f1 | — | Sparse* | 3 | |
| Semantic Textual Similarity Semantic similarity measures how close two pieces of text are in meaning — the foundation of duplicate detecti… sentence-similarity | STS Benchmark | Registered leader GTE-Qwen2-7B-instruct 88.40 · spearman | — | Sparse* | 3 | |
| Table Question Answering Table question answering bridges natural language and structured data — asking "what was Q3 revenue?" over a s… table-question-answering | WikiTableQuestions | Registered leader GPT-4 75.3% · accuracy | — | Sparse* | 3 |
Images, video frames, OCR, layout, tables, document parsing, detection, segmentation, and visual anomaly detection.
Current registered leader means the best result currently registered for this benchmark and metric. It does not mean universally best model.
| Task | Best first benchmark | Current registered leader | Trust | Status | Results | Actions |
|---|---|---|---|---|---|---|
| Document OCR Reading text, structure, and layout from document images. Why this benchmark?Why this benchmark?: OCRBench v2 is a broad first stop for document OCR and visual text extraction comparisons. What it measures: Text recognition, document understanding subtasks, and multi-scenario OCR capability. What it misses: Invoice extraction, handwriting-heavy archives, table fidelity, and local-language production OCR. When not to use: Do not use it as the only evidence for production document automation. | OCRBench v2 | Registered leader Qwen2.5-VL-72B 63.70 · overall | — | Canonical | 829 | |
| Scene Text Detection Detecting text regions in natural scene images | coco-text | Registered leader CLIP4STR-L 81.90 · 1-1-accuracy | A | Canonical* | 581 | |
| Document Layout Analysis Analyzing the layout structure of documents | d4la | Registered leader DoPTA 70.7% · map | — | Canonical* | 133 | |
| Scene Text Recognition Recognizing text in natural scene images | cute80 | Registered leader CPPD 99.7% · accuracy | — | Canonical* | 127 | |
| Document Parsing Parsing document structure and content Why this benchmark?Why this benchmark?: OmniDocBench is a stronger starting point when the output must preserve reading order, layout, tables, and formulas. What it measures: End-to-end document parsing quality across text, tables, formulas, and layout-sensitive output. What it misses: Vendor-specific workflow behavior, low-resource languages, and private document templates. | OmniDocBench v1.5 | Registered leader Mistral OCR 3 91.63 · reading-order | — | Emerging | 117 | |
| Table Recognition Detecting and parsing tables in documents | icdar2013-table-structure-recognition | Registered leader Proposed System (With post- processing) 95.46 · f-measure | — | Fragmented | 71 | |
| General OCR Capabilities Comprehensive benchmarks covering multiple aspects of OCR performance. | OCRBench v2 | Registered leader mistral-ocr-2512 25.20 · overall-en-private | — | Canonical* | 66 | |
| Document Image Classification Classifying documents by type or category | aip | Registered leader ResNet-RS (ResNet-200 + RS training tricks) 83.40 · top-1-accuracy-verb | — | Canonical* | 62 | |
| Handwriting Recognition Recognizing handwritten text | No canonical benchmark registered | Not enough registered evidence | — | Missing* | 40 | |
| Document Understanding Document understanding requires parsing visually rich documents — invoices, forms, scientific papers, tables —… document-question-answering | Form Understanding in Noisy Scanned Documents | Not enough registered evidence | — | Canonical* | 7 | |
| Semantic Segmentation Semantic segmentation assigns a class label to every pixel — the dense prediction problem that underpins auton… image-segmentation | ADE20K Scene Parsing Benchmark | Registered leader InternImage-H 62.9% · mIoU | — | Canonical* | 6 |
ASR, TTS, speaker intelligence, music, sound events, audio-language understanding, and audio safety.
Current registered leader means the best result currently registered for this benchmark and metric. It does not mean universally best model.
| Task | Best first benchmark | Current registered leader | Trust | Status | Results | Actions |
|---|---|---|---|---|---|---|
| Speech Recognition Automatic speech recognition went from a specialized pipeline (acoustic model + language model + decoder) to a… automatic-speech-recognitionWhy this benchmark?Why this benchmark?: ASR leaderboards remain useful when the split, language, domain, and WER normalization are explicit. What it measures: Transcription error rate on fixed audio corpora. What it misses: Diarization, streaming latency, noisy calls, code-switching, and downstream extraction quality. | Mozilla Common Voice | Registered leader Whisper Large-v2 11.20 · wer | — | Canonical | 22 | |
| Audio Captioning Generating text descriptions of audio content. | AudioCaps | Registered leader AudioCaps baseline (TopDown+Align) 36.9% · spider | — | Sparse* | 3 | |
| Music Generation Generating music from text, audio, or other inputs. | MusicCaps | Registered leader MusicLM 4.000 · fad | — | Sparse* | 3 | |
| Sound Event Detection Detecting and localizing sound events in audio. | Domestic Environment Sound Event Detection (DCASE Task 4) | Registered leader ATST-SED 58.10 · event-f1 | — | Sparse* | 3 | |
| Speaker Verification Verifying speaker identity from voice samples. | VoxCeleb1 Original Test Set (VoxCeleb1-O) | Registered leader ResNet-34 (AM-Softmax, VoxCeleb2) 1.180 · eer | — | Sparse* | 3 | |
| Speech Translation Translating spoken audio directly to another language. | MuST-C English-German tst-COMMON | Registered leader SeamlessM4T v2 Large 37.1% · bleu | — | Sparse* | 3 |
Cross-modal tasks only: VQA, image-text retrieval, video QA, document VQA, text-to-image, image editing, and any-to-any media models.
Current registered leader means the best result currently registered for this benchmark and metric. It does not mean universally best model.
| Task | Best first benchmark | Current registered leader | Trust | Status | Results | Actions |
|---|---|---|---|---|---|---|
| Visual Question Answering Visual question answering (VQA) is the original multimodal reasoning task — given an image and a natural langu… visual-question-answeringWhy this benchmark?Why this benchmark?: VQA is best treated as a family: chart, document, spatial, and general image QA can disagree. What it measures: Question answering over visual inputs on fixed task distributions. What it misses: OCR-heavy documents, long-context image sets, grounding, and tool-assisted visual reasoning. | Visual Question Answering v2.0 | Registered leader Qwen2-VL 72B 87.6% · accuracy | — | Fragmented | 47 | |
| Image Captioning Image captioning — generating natural language descriptions of images — was the task that launched the modern… image-to-text | COCO Captions | Registered leader BLIP-2 145.8% · CIDEr | A | Sparse* | 2 | |
| Text-to-Image Generation Text-to-image generation went from "interesting research" to cultural phenomenon in 18 months. DALL-E 2 (2022)… text-to-image | DPG-Bench | Not enough registered evidence | — | Contested | 0 |
Code generation, completion, repair, repository understanding, tests, vulnerability work, UI code, and mobile app code generation.
Current registered leader means the best result currently registered for this benchmark and metric. It does not mean universally best model.
| Task | Best first benchmark | Current registered leader | Trust | Status | Results | Actions |
|---|---|---|---|---|---|---|
| Code Generation Generating code from natural language descriptions (HumanEval, MBPP). Why this benchmark?Why this benchmark?: LiveCodeBench is harder to memorize than older static coding sets and tracks recent coding ability. What it measures: Competitive programming style code generation on dated, rolling problems. What it misses: Repository repair, tool use, test writing, and long-horizon software engineering work. When not to use: Use SWE-bench or repository benchmarks for agentic coding work. | LiveCodeBench | Registered leader DeepSeek-R1-0528 73.3% · pass@1 | B | Canonical | 196 | |
| React Native Code Generation Evaluating AI models on generating correct, production-quality React Native implementations. Covers animation,… | Callstack Incubator React Native Evaluation Suite | Registered leader Composer 2 98.90 · navigation-satisfaction | — | Canonical* | 40 | |
| Code Translation Converting code between programming languages. | TransCoder Evaluation on GeeksForGeeks Algorithmic Problems | Registered leader Claude Sonnet 4 89.40 · computational-accuracy | — | Canonical* | 7 | |
| Bug Detection Identifying bugs and vulnerabilities in code. | Bugs2Fix: Learning to Rewrite Buggy Code | Registered leader GPT-4o 78.6% · accuracy | — | Canonical* | 6 | |
| Code Completion Predicting the next tokens in code sequences. | Cross-File Code Completion Evaluation | Registered leader Claude Sonnet 4 44.50 · exact-match | — | Canonical* | 6 | |
| Program Repair Automatically fixing bugs in code. | Defects4J: A Database of Real Faults in Java Programs | Registered leader SRepair 101.0 · correct-patches | — | Canonical* | 5 | |
| Code Summarization Generating natural language descriptions of code. | CodeXGLUE Code-to-Text Python subset | Registered leader CodeT5-base 20.0% · bleu | — | Sparse* | 3 |
Tool calling, web and desktop agents, browser automation, long-horizon autonomy, multi-agent coordination, and agent safety.
Current registered leader means the best result currently registered for this benchmark and metric. It does not mean universally best model.
| Task | Best first benchmark | Current registered leader | Trust | Status | Results | Actions |
|---|---|---|---|---|---|---|
| SWE-bench SWE-bench — resolving real GitHub issues from popular Python repositories — became the defining benchmark for… Why this benchmark?Why this benchmark?: SWE-bench Verified is the usual first benchmark for autonomous agents repairing real GitHub issues. What it measures: Patch generation that resolves repository issues against validation tests. What it misses: Product judgment, multi-day maintenance work, security review, and UI-heavy tasks. | SWE-bench Verified — Agentic Leaderboard | Registered leader Claude Mythos Preview 93.90 · resolve-rate | B | Canonical | 81 | |
| Task agents AI agents are autonomous software systems that use artificial intelligence to achieve goals and complete tasks… | No canonical benchmark registered | Not enough registered evidence | — | Missing* | 35 | |
| Autonomous Coding Agent benchmarks where systems complete coding, terminal, repository, or developer-workflow tasks with minimal… | SWE-bench Verified (Agentic) | Registered leader Claude Opus 4.5 80.90 · pct_resolved | B | Canonical* | 23 | |
| Web & Desktop Agents Web and desktop agents — AI systems that operate browsers and GUIs to complete real tasks — are benchmarked by… | OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments | Registered leader CoAct-1 60.76 · success-rate | — | Canonical* | 19 | |
| Tool Use Benchmarks measuring AI agents ability to use tools and APIs to complete real-world tasks across domains like… | No canonical benchmark registered | Not enough registered evidence | — | Missing* | 8 | |
| HCAST HCAST (Human-Calibrated Autonomy Software Tasks) is a 90-task benchmark from METR designed to measure AI auton… | Human-Calibrated Autonomy Software Tasks | Registered leader Claude Opus 4 55.00 · success-rate | — | Canonical* | 6 | |
| RE-Bench RE-Bench (Research Engineering Benchmark) from METR evaluates AI agents on 7 open-ended ML research engineerin… | Research Engineering Benchmark | Registered leader o3 0.380 · normalized-score | — | Canonical* | 5 | |
| Time Horizon Time horizon — how long an AI agent can work autonomously before requiring human correction — is arguably the… | METR Autonomy Evaluation: Time Horizon | Registered leader Claude Opus 4 60.00 · task-horizon-minutes | — | Canonical* | 5 | |
| Bioinformatics Agents LLM-agent benchmarks for computational biology — exploring datasets, running multi-step analyses, and interpre… | No canonical benchmark registered | Not enough registered evidence | — | Missing* | 2 |
Tables, tabular classification and regression, time-series forecasting, anomaly detection, recommender systems, graph learning, and optimization.
Current registered leader means the best result currently registered for this benchmark and metric. It does not mean universally best model.
| Task | Best first benchmark | Current registered leader | Trust | Status | Results | Actions |
|---|---|---|---|---|---|---|
| Node Classification Node classification — assigning labels to vertices in a graph using both node features and neighborhood struct… graph-ml | Cora Citation Network | Registered leader ACNet 83.5% · accuracy | — | Canonical* | 6 | |
| Tabular Classification Tabular classification — predicting discrete labels from structured rows and columns — remains the one domain… tabular-classification | OpenML-CC18 | Registered leader AutoGluon-Tabular 88.5% · accuracy | — | Canonical* | 5 | |
| Link Prediction Link prediction — inferring missing or future edges in a graph — underpins knowledge graph completion, drug-ta… | Open Graph Benchmark - ogbl-collab | Registered leader PROXI 70.98 · hits_at_50 | — | Sparse* | 3 | |
| Molecular Property Prediction Molecular property prediction — estimating toxicity, solubility, binding affinity, or other properties from mo… | Open Graph Benchmark - ogbg-molhiv | Registered leader DGN 79.70 · roc_auc | — | Sparse* | 3 | |
| Tabular Regression Tabular regression — predicting continuous values from structured data — powers everything from house-price es… tabular-regression | California Housing | Registered leader XGBoost 0.453 · rmse | — | Sparse* | 2 |
Game playing, continuous control, manipulation, navigation, embodied instruction following, VLA models, drones, and autonomous driving.
Current registered leader means the best result currently registered for this benchmark and metric. It does not mean universally best model.
| Task | Best first benchmark | Current registered leader | Trust | Status | Results | Actions |
|---|---|---|---|---|---|---|
| Atari Games Atari games became the canonical RL benchmark when DeepMind's DQN (2013) learned to play Breakout from raw pix… reinforcement-learning | Arcade Learning Environment (Atari 2600) | Registered leader Go-Explore 40000.0 · human-normalized-score | — | Canonical* | 12 | |
| Continuous Control Continuous control — learning smooth motor commands in simulated physics — was transformed by MuJoCo and the O… | Multi-Joint dynamics with Contact | Registered leader TD-MPC2 (317M params) 960.0 · average-return | — | Canonical* | 9 |
A domain layer for medical imaging, clinical text, drug discovery, protein modeling, industrial inspection, remote sensing, climate, legal, finance, and compliance AI.
Current registered leader means the best result currently registered for this benchmark and metric. It does not mean universally best model.
| Task | Best first benchmark | Current registered leader | Trust | Status | Results | Actions |
|---|---|---|---|---|---|---|
| Disease Classification Diagnosing diseases from medical images or data. | Autism Brain Imaging Data Exchange I | Registered leader SSAE + Softmax (Explainable ASD) 98.2% · accuracy | — | Canonical* | 57 | |
| Anomaly Detection Detecting defects and anomalies in manufacturing (MVTec AD, VisA). | MVTec Anomaly Detection Dataset | Registered leader AnomalyGPT 97.40 · auroc | — | Canonical* | 27 | |
| Medical Image Segmentation Segmenting organs and abnormalities in medical images. | Automated Cardiac Diagnosis Challenge | Registered leader MedNeXt-L 92.65 · mean-dsc | — | Canonical* | 26 |
Benchmarks are not equally believable. Some are held out behind a private evaluator; some ship their test set as part of the training corpus. We grade the canonical dataset of every task on a four-point scale and show the letter next to the score.
A dataset can be regraded in public at any time; the history is preserved on the benchmark page. We publish the regrade, we don't erase the prior.
HuggingFace pipeline-tag categories. These group concrete tasks thematically; they are not themselves measurable. Use them to navigate to the real rankings.
True omni: any modality in (text + image + audio + video) AND generate multiple modalities out (including speech, not just text). Narrowest open-weights category — Qwen3-Omni · Vita · Mini-Omni. Proprietary: GPT-4o · Gemini 3 · Sesame CSM.
Vision-Language Models that read images and produce text answers.
Image editing and inpainting conditioned on text prompts.
Animate a still image guided by a text prompt.
Multimodal LLMs that listen and respond in text.
Speech translation, voice conversion, audio enhancement.
Video editing, style transfer, super-resolution.
Generate a 3D mesh or NeRF from one or more images.
Generate a 3D asset from a text prompt.
Music, sound effects, environmental audio from text.
Animate a still image into a short clip.
Generative image models without text conditioning (DCGAN, StyleGAN era).
Most leaderboards are a ledger of claims. Authors submit a number, a banner appears; the number stands until the next banner appears. Codesota is different in three ordinary ways.
First, every submission carries code. Not a repo link alone — a frozen commit, a declared environment, a recorded seed. If it does not run, the row does not publish.
Second, every benchmark has a metric direction. Higher-is-better and lower-is-better are declared on the dataset; no ambiguity reaches the reader.
Third, every score carries a date. When a model regresses — and they do — the record is preserved. The table never silently forgets.