All Verified Results
258 benchmark results across 50 datasets. Every data point links to its source.
258
Total Results
50
Benchmarks
121
Models
JSON API: Download raw data at /data/benchmarks.json
Complete Results Table
| Model | Dataset | Metric | Value | Source |
|---|---|---|---|---|
| coca-finetuned | imagenet-1k | top-1-accuracy | 91 | google-research |
| vit-g-14 | imagenet-1k | top-1-accuracy | 90.45 | google-research |
| convnext-v2-huge | imagenet-1k | top-1-accuracy | 88.9 | meta-research |
| vit-h-14 | imagenet-1k | top-1-accuracy | 88.55 | google-research |
| swin-large | imagenet-1k | top-1-accuracy | 87.3 | microsoft-research |
| efficientnet-v2-l | imagenet-1k | top-1-accuracy | 85.7 | google-research |
| deit-b-distilled | imagenet-1k | top-1-accuracy | 85.2 | meta-research |
| efficientnet-b7 | imagenet-1k | top-1-accuracy | 84.4 | google-research |
| deit-b | imagenet-1k | top-1-accuracy | 83.1 | meta-research |
| convnext-v2-tiny | imagenet-1k | top-1-accuracy | 83 | meta-research |
| vit-l-16 | imagenet-1k | top-1-accuracy | 82.7 | google-research |
| vit-b-16 | imagenet-1k | top-1-accuracy | 81.2 | google-research |
| resnet-50-a3 | imagenet-1k | top-1-accuracy | 80.4 | timm-research |
| resnet-152 | imagenet-1k | top-1-accuracy | 78.6 | microsoft-research |
| efficientnet-b0 | imagenet-1k | top-1-accuracy | 77.1 | google-research |
| resnet-50 | imagenet-1k | top-1-accuracy | 76.15 | pytorch-vision |
| swin-v2-large | imagenet-v2 | top-1-accuracy | 84 | microsoft-research |
| convnext-v2-huge | imagenet-v2 | top-1-accuracy | 80.5 | meta-research |
| vit-h-14 | cifar-100 | accuracy | 94.55 | google-research |
| vit-b-16 | cifar-100 | accuracy | 91.48 | huggingface |
| deit-b-distilled | cifar-10 | accuracy | 99.1 | meta-research |
| convnext-v2-base | cifar-10 | accuracy | 98.7 | meta-research |
| resnet-50 | cifar-10 | accuracy | 96.01 | cutout-paper |
| efficientnet-b7 | cifar-100 | accuracy | 91.7 | google-research |
| resnet-50 | cifar-100 | accuracy | 78.04 | cutout-paper |
| paddleocr-vl | omnidocbench | composite | 92.86 | alphaxiv-leaderboard |
| paddleocr-vl-0.9b | omnidocbench | composite | 92.56 | alphaxiv-leaderboard |
| mineru-2.5 | omnidocbench | composite | 90.67 | alphaxiv-leaderboard |
| qwen3-vl-235b | omnidocbench | composite | 89.15 | alphaxiv-leaderboard |
| monkeyocr-pro-3b | omnidocbench | composite | 88.85 | alphaxiv-leaderboard |
| gemini-25-pro | omnidocbench | composite | 88.03 | alphaxiv-leaderboard |
| qwen25-vl | omnidocbench | composite | 87.02 | alphaxiv-leaderboard |
| ocrverse-4b | omnidocbench | composite | 88.56 | github-leaderboard |
| dots-ocr-3b | omnidocbench | composite | 88.41 | github-leaderboard |
| mistral-ocr-3 | omnidocbench | composite | 79.75 | codesota-verified |
| mistral-ocr-3 | omnidocbench | text-edit-distance | 0.099 | codesota-verified |
| mistral-ocr-3 | omnidocbench | table-teds | 70.88 | codesota-verified |
| mistral-ocr-3 | omnidocbench | formula-edit-distance | 0.218 | codesota-verified |
| mistral-ocr-3 | omnidocbench | reading-order | 91.63 | codesota-verified |
| clearocr-teamquest | omnidocbench | composite | 31.7 | codesota-verified |
| clearocr-teamquest | omnidocbench | text-edit-distance | 0.154 | codesota-verified |
| clearocr-teamquest | omnidocbench | table-teds | 0.8 | codesota-verified |
| clearocr-teamquest | omnidocbench | formula-edit-distance | 0.902 | codesota-verified |
| clearocr-teamquest | omnidocbench | reading-order | 86.04 | codesota-verified |
| gpt-4o | omnidocbench | ocr-edit-distance | 0.02 | alphaxiv-leaderboard |
| paddleocr-vl | omnidocbench | table-teds | 93.52 | alphaxiv-leaderboard |
| mineru-2.5 | omnidocbench | layout-map | 97.5 | alphaxiv-leaderboard |
| seed-1.6-vision | ocrbench-v2 | overall-en-private | 62.2 | alphaxiv-leaderboard |
| qwen3-omni-30b | ocrbench-v2 | overall-en-private | 61.3 | alphaxiv-leaderboard |
| nemotron-nano-v2-vl | ocrbench-v2 | overall-en-private | 61.2 | alphaxiv-leaderboard |
| gemini-25-pro | ocrbench-v2 | overall-en-private | 59.3 | alphaxiv-leaderboard |
| gpt-4o | ocrbench-v2 | overall-en-private | 55.5 | alphaxiv-leaderboard |
| gemini-25-pro | ocrbench-v2 | overall-zh-private | 62.2 | alphaxiv-leaderboard |
| chandra-ocr-0.1.0 | olmocr-bench | pass-rate | 83.1 | alphaxiv-leaderboard |
| chandra-ocr-0.1.0 | olmocr-bench | tables | 88 | github-readme |
| chandra-ocr-0.1.0 | olmocr-bench | old-scans-math | 80.3 | github-readme |
| chandra-ocr-0.1.0 | olmocr-bench | long-tiny-text | 92.3 | github-readme |
| chandra-ocr-0.1.0 | olmocr-bench | base | 99.9 | github-readme |
| chandra-ocr-0.1.0 | olmocr-bench | headers-footers | 90.8 | github-readme |
| chandra-ocr-0.1.0 | olmocr-bench | multi-column | 81.2 | github-readme |
| chandra-ocr-0.1.0 | olmocr-bench | arxiv | 82.2 | github-readme |
| chandra-ocr-0.1.0 | olmocr-bench | old-scans | 50.4 | github-readme |
| deepseek-ocr | olmocr-bench | pass-rate | 75.4 | github-readme |
| dots-ocr-3b | olmocr-bench | pass-rate | 79.1 | github-readme |
| marker-1.10.0 | olmocr-bench | pass-rate | 76.5 | github-readme |
| gpt-4o-anchored | olmocr-bench | pass-rate | 69.9 | github-readme |
| gemini-flash-2 | olmocr-bench | pass-rate | 63.8 | github-readme |
| dots-ocr-3b | olmocr-bench | tables | 88.3 | github-readme |
| olmocr-v0.3.0 | olmocr-bench | old-scans-math | 79.9 | github-readme |
| olmocr-v0.3.0 | olmocr-bench | headers-footers | 95.1 | github-readme |
| marker-1.10.0 | olmocr-bench | arxiv | 83.8 | github-readme |
| gpt-4o | olmocr-bench | old-scans | 40.7 | github-readme |
| infinity-parser-7b | olmocr-bench | pass-rate | 82.5 | alphaxiv-leaderboard |
| olmocr-v0.4.0 | olmocr-bench | pass-rate | 82.4 | alphaxiv-leaderboard |
| paddleocr-vl | olmocr-bench | pass-rate | 80 | alphaxiv-leaderboard |
| marker-1.10.1 | olmocr-bench | pass-rate | 76.1 | alphaxiv-leaderboard |
| deepseek-ocr | olmocr-bench | pass-rate | 75.7 | alphaxiv-leaderboard |
| mineru-2.5 | olmocr-bench | pass-rate | 75.2 | alphaxiv-leaderboard |
| mistral-ocr-3 | olmocr-bench | pass-rate | 78 | mistral-announcement |
| mistral-ocr-3 | internal-mistral | overall-accuracy | 94.9 | mistral-announcement |
| mistral-ocr-3 | ocr-cer-benchmark | cer | 3.7 | sparkco-benchmark |
| mistral-ocr-3 | ocr-wer-benchmark | wer | 7.1 | sparkco-benchmark |
| mistral-ocr-api | olmocr-bench | pass-rate | 72 | alphaxiv-leaderboard |
| nanonets-ocr2-3b | olmocr-bench | pass-rate | 69.5 | alphaxiv-leaderboard |
| churro-3b | churro-ds | handwritten-levenshtein | 70.1 | alphaxiv-leaderboard |
| churro-3b | churro-ds | printed-levenshtein | 82.3 | alphaxiv-leaderboard |
| gemini-25-pro | churro-ds | handwritten-levenshtein | 63.6 | alphaxiv-leaderboard |
| gemini-25-pro | churro-ds | printed-levenshtein | 80.9 | alphaxiv-leaderboard |
| gemini-25-flash | churro-ds | handwritten-levenshtein | 58.7 | alphaxiv-leaderboard |
| qwen25-vl-72b | churro-ds | handwritten-levenshtein | 54.5 | alphaxiv-leaderboard |
| claude-sonnet-4 | churro-ds | handwritten-levenshtein | 37.1 | alphaxiv-leaderboard |
| gpt-4o | churro-ds | handwritten-levenshtein | 34.2 | alphaxiv-leaderboard |
| gemini-15-pro | cc-ocr | multi-scene-f1 | 83.25 | alphaxiv-leaderboard |
| qwen2-vl-72b | cc-ocr | multi-scene-f1 | 77.95 | alphaxiv-leaderboard |
| internvl2-76b | cc-ocr | multi-scene-f1 | 76.92 | alphaxiv-leaderboard |
| gpt-4o | cc-ocr | multi-scene-f1 | 76.4 | alphaxiv-leaderboard |
| claude-35-sonnet | cc-ocr | multi-scene-f1 | 72.87 | alphaxiv-leaderboard |
| qwen2-vl-72b | cc-ocr | kie-f1 | 71.76 | alphaxiv-leaderboard |
| gemini-15-pro | cc-ocr | kie-f1 | 67.28 | alphaxiv-leaderboard |
| claude-35-sonnet | cc-ocr | kie-f1 | 64.58 | alphaxiv-leaderboard |
| gpt-4o | cc-ocr | kie-f1 | 63.45 | alphaxiv-leaderboard |
| gemini-15-pro | cc-ocr | multilingual-f1 | 78.97 | alphaxiv-leaderboard |
| gpt-4o | cc-ocr | multilingual-f1 | 73.44 | alphaxiv-leaderboard |
| gemini-15-pro | cc-ocr | document-parsing | 62.37 | alphaxiv-leaderboard |
| gemini-25-pro | mme-videoocr | total-accuracy | 73.7 | alphaxiv-leaderboard |
| qwen25-vl-72b | mme-videoocr | total-accuracy | 69 | alphaxiv-leaderboard |
| internvl3-78b | mme-videoocr | total-accuracy | 67.2 | alphaxiv-leaderboard |
| gpt-4o | mme-videoocr | total-accuracy | 66.4 | alphaxiv-leaderboard |
| gemini-15-pro | mme-videoocr | total-accuracy | 64.9 | alphaxiv-leaderboard |
| qwen25-vl-32b | mme-videoocr | total-accuracy | 61 | alphaxiv-leaderboard |
| gemini-20-flash | kitab-bench | cer | 0.13 | alphaxiv-leaderboard |
| ain-7b | kitab-bench | cer | 0.2 | alphaxiv-leaderboard |
| gpt-4o | kitab-bench | cer | 0.31 | alphaxiv-leaderboard |
| gpt-4o-mini | kitab-bench | cer | 0.43 | alphaxiv-leaderboard |
| azure-ocr | kitab-bench | cer | 0.52 | alphaxiv-leaderboard |
| tesseract | kitab-bench | cer | 0.54 | alphaxiv-leaderboard |
| easyocr | kitab-bench | cer | 0.58 | alphaxiv-leaderboard |
| paddleocr | kitab-bench | cer | 0.79 | alphaxiv-leaderboard |
| claude-sonnet-4 | thaiocrbench | ted-score | 0.84 | alphaxiv-leaderboard |
| gemini-25-pro | thaiocrbench | ted-score | 0.77 | alphaxiv-leaderboard |
| qwen25-vl-32b | thaiocrbench | ted-score | 0.765 | alphaxiv-leaderboard |
| internvl3-14b | thaiocrbench | ted-score | 0.76 | alphaxiv-leaderboard |
| qwen25-vl-72b | thaiocrbench | ted-score | 0.72 | alphaxiv-leaderboard |
| o1-preview | gsm8k | accuracy | 97.8 | openai-blog |
| gpt-4o | gsm8k | accuracy | 92 | openai-blog |
| claude-35-sonnet | gsm8k | accuracy | 96.4 | anthropic-blog |
| gemini-15-pro | gsm8k | accuracy | 91.7 | google-blog |
| llama-3-70b | gsm8k | accuracy | 93 | meta-blog |
| o1-preview | math | accuracy | 94.8 | openai-blog |
| gpt-4o | math | accuracy | 76.6 | openai-blog |
| claude-35-sonnet | math | accuracy | 71.1 | anthropic-blog |
| gemini-15-pro | math | accuracy | 67.7 | google-blog |
| deepseek-v3 | math | accuracy | 90.2 | deepseek-blog |
| o1-preview | aime-2024 | accuracy | 83.3 | openai-blog |
| gpt-4o | aime-2024 | accuracy | 13.4 | openai-blog |
| claude-35-opus | aime-2024 | accuracy | 16 | anthropic-blog |
| gpt-4o | hellaswag | accuracy | 95.3 | openai-blog |
| claude-35-sonnet | hellaswag | accuracy | 89 | anthropic-blog |
| llama-3-70b | hellaswag | accuracy | 88 | meta-blog |
| gemini-15-pro | hellaswag | accuracy | 92.5 | google-blog |
| gpt-4o | winogrande | accuracy | 87.5 | openai-blog |
| claude-35-sonnet | winogrande | accuracy | 85.4 | anthropic-blog |
| llama-3-70b | winogrande | accuracy | 85.3 | meta-blog |
| gpt-4o | arc-challenge | accuracy | 96.4 | openai-blog |
| claude-35-sonnet | arc-challenge | accuracy | 96.7 | anthropic-blog |
| llama-3-70b | arc-challenge | accuracy | 93 | meta-blog |
| gemini-15-pro | arc-challenge | accuracy | 94.8 | google-blog |
| gpt-4o | mmlu | accuracy | 88.7 | openai-blog |
| o1-preview | mmlu | accuracy | 92.3 | openai-blog |
| claude-35-sonnet | mmlu | accuracy | 88.7 | anthropic-blog |
| gemini-15-pro | mmlu | accuracy | 85.9 | google-blog |
| llama-3-70b | mmlu | accuracy | 82 | meta-blog |
| deepseek-v3 | mmlu | accuracy | 88.5 | deepseek-blog |
| o1-preview | gpqa | accuracy | 78 | openai-blog |
| gpt-4o | gpqa | accuracy | 53.6 | openai-blog |
| claude-35-sonnet | gpqa | accuracy | 59.4 | anthropic-blog |
| gemini-15-pro | gpqa | accuracy | 46.2 | google-blog |
| gpt-4o | commonsenseqa | accuracy | 85.4 | openai-blog |
| claude-35-sonnet | commonsenseqa | accuracy | 83.2 | anthropic-blog |
| llama-3-70b | commonsenseqa | accuracy | 80.9 | meta-blog |
| gpt-4o | hotpotqa | f1 | 71.3 | arxiv-paper |
| claude-35-sonnet | hotpotqa | f1 | 68.5 | arxiv-paper |
| gpt-4o | strategyqa | accuracy | 82.1 | arxiv-paper |
| claude-35-sonnet | strategyqa | accuracy | 79.8 | arxiv-paper |
| gpt-4o | logiqa | accuracy | 56.3 | arxiv-paper |
| claude-35-sonnet | logiqa | accuracy | 53.8 | arxiv-paper |
| gpt-4o | reclor | accuracy | 72.4 | arxiv-paper |
| claude-35-sonnet | reclor | accuracy | 68.9 | arxiv-paper |
| gpt-4o | svamp | accuracy | 93.7 | arxiv-paper |
| claude-35-sonnet | svamp | accuracy | 91.2 | arxiv-paper |
| llama-3-70b | svamp | accuracy | 89.5 | meta-blog |
| gpt-4o | mawps | accuracy | 97.2 | arxiv-paper |
| claude-35-sonnet | mawps | accuracy | 95.8 | arxiv-paper |
| llama-3-70b | mawps | accuracy | 94.1 | meta-blog |
| plymouth-dl-model | abide-i | accuracy | 98 | research-paper |
| deepasd | abide-ii | auc | 93 | research-paper |
| mcbert | abide-i | accuracy | 93.4 | research-paper |
| ae-fcn | abide-i | accuracy | 85 | research-paper |
| braingт | abide-i | auc | 78.7 | research-paper |
| asd-swnet | abide-i | accuracy | 76.52 | research-paper |
| asd-swnet | abide-i | auc | 81 | research-paper |
| al-negat | abide-i | accuracy | 74.7 | research-paper |
| braingnn | abide-i | accuracy | 73.3 | research-paper |
| gcn | abide-i | accuracy | 72.2 | research-paper |
| gcn | abide-i | auc | 78 | research-paper |
| multi-task-transformer | abide-i | accuracy | 72 | research-paper |
| svm-connectivity | abide-i | accuracy | 70.1 | research-paper |
| svm-connectivity | abide-i | auc | 77 | research-paper |
| deep-learning-heinsfeld | abide-i | accuracy | 70 | research-paper |
| mvs-gcn | abide-i | accuracy | 69.38 | research-paper |
| mvs-gcn | abide-i | auc | 69.01 | research-paper |
| phgcl-ddgformer | abide-i | accuracy | 70.9 | research-paper |
| random-forest | abide-i | accuracy | 63 | research-paper |
| maacnn | abide-i | accuracy | 75.12 | research-paper |
| maacnn | abide-ii | accuracy | 72.88 | research-paper |
| multi-atlas-dnn | abide-i | accuracy | 78.07 | research-paper |
| abraham-connectomes | abide-i | accuracy | 67 | research-paper |
| o1-preview | humaneval | pass@1 | 92.4 | openai-blog |
| claude-35-sonnet | humaneval | pass@1 | 92 | anthropic-blog |
| gpt-4o | humaneval | pass@1 | 90.2 | openai-blog |
| deepseek-v3 | humaneval | pass@1 | 82.6 | deepseek-blog |
| llama-3-70b | humaneval | pass@1 | 81.7 | meta-blog |
| claude-35-sonnet | swe-bench-verified | resolve-rate | 49 | anthropic-blog |
| gpt-4o | swe-bench-verified | resolve-rate | 41.2 | swe-bench-leaderboard |
| deepseek-v25 | swe-bench-verified | resolve-rate | 37 | deepseek-blog |
| gpt-4o | mbpp | pass@1 | 87.8 | openai-blog |
| claude-35-sonnet | mbpp | pass@1 | 89.2 | anthropic-blog |
| internimage-h | coco | mAP | 65.4 | arxiv-paper |
| co-detr-swin-l | coco | mAP | 66 | arxiv-paper |
| dino-swin-l | coco | mAP | 63.3 | arxiv-paper |
| yolov10-x | coco | mAP | 57.4 | github-readme |
| efficientdet-d7-x | coco | mAP | 55.1 | google-research |
| internimage-h | ade20k | mIoU | 62.9 | arxiv-paper |
| mask2former-swin-l | ade20k | mIoU | 57.3 | arxiv-paper |
| agent57 | atari-2600 | human-normalized-score | 4731.3 | deepmind-research |
| go-explore | atari-2600 | human-normalized-score | 40000 | nature-paper |
| muzero | atari-2600 | human-normalized-score | 731 | nature-paper |
| dreamerv3 | atari-2600 | human-normalized-score | 840 | arxiv-paper |
| rainbow-dqn | atari-2600 | human-normalized-score | 231 | aaai-paper |
| dqn | atari-2600 | human-normalized-score | 79 | nature-paper |
| human-gamer | atari-2600 | human-normalized-score | 100 | baseline |
| bbos-1 | atari-2600 | human-normalized-score | 1100 | research |
| gdi-h3 | atari-2600 | human-normalized-score | 950 | research |
| chexpert-auc-maximizer | chexpert | auroc | 93 | stanford-leaderboard |
| chexzero | chexpert | auroc | 88.6 | research-paper |
| torchxrayvision | chexpert | auroc | 87.4 | github-readme |
| densenet-121-cxr | chexpert | auroc | 86.5 | research-paper |
| gloria | chexpert | auroc | 88.2 | research-paper |
| medclip | chexpert | auroc | 87.8 | research-paper |
| biovil | chexpert | auroc | 89.1 | microsoft-research |
| chexnet | nih-chestxray14 | auroc | 84.1 | research-paper |
| torchxrayvision | nih-chestxray14 | auroc | 85.8 | github-readme |
| densenet-121-cxr | nih-chestxray14 | auroc | 82.6 | research-paper |
| resnet-50-cxr | nih-chestxray14 | auroc | 80.4 | research-paper |
| chexzero | mimic-cxr | auroc | 89.2 | research-paper |
| torchxrayvision | mimic-cxr | auroc | 86.3 | github-readme |
| convirt | mimic-cxr | auroc | 85.7 | research-paper |
| rad-dino | vindr-cxr | auroc | 91.2 | microsoft-research |
| torchxrayvision | vindr-cxr | auroc | 87.9 | research-paper |
| densenet-121-cxr | rsna-pneumonia | auroc | 88.5 | kaggle-competition |
| chexnet | rsna-pneumonia | auroc | 87.2 | research-paper |
| torchxrayvision | padchest | auroc | 84.6 | github-readme |
| densenet-121-cxr | covid-chestxray | auroc | 94.7 | research-paper |
| torchxrayvision | covid-chestxray | auroc | 93.2 | github-readme |
| patchcore | mvtec-ad | auroc | 99.1 | research-paper |
| efficientad | mvtec-ad | auroc | 99.1 | research-paper |
| simplenet | mvtec-ad | auroc | 99.6 | research-paper |
| padim | mvtec-ad | auroc | 97.9 | research-paper |
| fastflow | mvtec-ad | auroc | 99.4 | research-paper |
| draem | mvtec-ad | auroc | 98 | research-paper |
| cflow-ad | mvtec-ad | auroc | 98.3 | research-paper |
| reverse-distillation | mvtec-ad | auroc | 98.5 | research-paper |
| patchcore | visa | auroc | 92.1 | research-paper |
| simplenet | visa | auroc | 95.5 | research-paper |
| efficientad | visa | auroc | 94.8 | research-paper |
| yolov8-weld | weld-defect-xray | map | 87.3 | research |
| defectdet-resnet | neu-det | map | 78.4 | research |
| yolov8-weld | severstal-steel | dice | 91.2 | kaggle |
Pending Verification
These results are claimed in papers but need manual verification from the source PDF.
| Model | Dataset | Claimed Value | Status |
|---|---|---|---|
| trocr-large | sroie | 96.58 | needs-pdf-verification |
| trocr-large | iam | 2.89 | needs-pdf-verification |
| paddleocr-v4 | icdar-2015 | Unknown | needs-documentation-verification |
| polish-roberta-ocr | poleval-2021-ocr | Unknown | |
| polish-t5-ocr | poleval-2021-ocr | Unknown | |
| herbert | poleval-2021-ocr | Unknown | |
| abbyy-finereader | impact-psnc | Unknown | |
| tesseract-polish | impact-psnc | Unknown | |
| abbyy-finereader | impact-psnc | Unknown | |
| tesseract-polish | impact-psnc | Unknown | |
| tesseract-polish | codesota-polish | Unknown | |
| tesseract-polish | codesota-polish | Unknown | |
| tesseract-polish | codesota-polish | Unknown | |
| tesseract-polish | codesota-polish-wikipedia | Unknown | |
| tesseract-polish | codesota-polish-real | Unknown | |
| tesseract-polish | codesota-polish-synth-random | Unknown | |
| tesseract-polish | codesota-polish-synth-words | Unknown |
Data Quality
All benchmark results are sourced from AlphaXiv benchmark leaderboards. Each data point includes the source URL for verification.
Results marked as "pending verification" are claimed in papers but have not been independently confirmed. We do not include estimated or interpolated values.