Next-generation scene text recognition benchmark assembled from 14 datasets (4M labeled + 10M unlabeled images). Accuracy drops 33-48% vs standard benchmarks, exposing real-world model limitations across 7 challenge categories: Artistic, Multi-Oriented, Salient, Multi-Words, General, Contextless, Incomplete.
Accuracy is the reported evaluation metric for Union14M. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
Muted rows were not state of the art when published — an earlier or same-year result already scored better.
| Rank | Model | Trust | Score | Year | Links | Fix |
|---|---|---|---|---|---|---|
| 01 | CLIP4STR-B | paper | 70.8 | 2026 | Source ↗ | Looks wrong? |
| 02 | PARSeq | paper | 67.8 | 2026 | Source ↗ | Looks wrong? |
| 03 | CLIP4STR | paper | 67.3 | 2026 | Source ↗ | Looks wrong? |
| 04 | LPV-S | paper | 65.1 | 2026 | Source ↗ | Looks wrong? |
| 05 | MAERec-S | paper | 62.4 | 2026 | Source ↗ | Looks wrong? |
| 06 | MATRN | paper | 61.2 | 2026 | Source ↗ | Looks wrong? |
| 07 | CDistNet | paper | 56.2 | 2026 | Source ↗ | Looks wrong? |