Scene Text Recognition
Recognizing text in natural scene images
Scene text recognition reads the text content from cropped images of text regions detected in natural scenes. It handles diverse fonts, distortions, partial occlusion, and variable illumination that make it much harder than printed OCR. Modern methods (ABINet, PARSeq, MAERec) achieve 97%+ accuracy on standard benchmarks, but real-world irregular text, especially multilingual, remains challenging.
History
ICDAR scene text recognition competitions begin; early methods use HOG + SVM character classification
CRNN (Shi et al.) combines CNN feature extraction with BiLSTM sequence modeling and CTC loss — becomes the standard architecture for years
Attention-based methods (RARE, ASTER) add spatial transformer networks (STN) to rectify distorted text before recognition
MORAN and ESIR improve text rectification, pushing accuracy on curved text benchmarks significantly
ABINet introduces autonomous, bidirectional, and iterative language modeling into scene text recognition, using linguistic context to correct visual errors
PARSeq (Bautista & Atienza) uses permutation language modeling — reading text in multiple orders during training — achieving 97%+ on standard benchmarks
MAERec applies masked autoencoding to text recognition pretraining, improving performance on irregular and low-quality text
CLIP4STR leverages CLIP's visual-linguistic pretraining for text recognition, bridging scene understanding and reading
Union14M benchmark (Song et al.) provides challenging real-world evaluation; end-to-end spotters eliminate the separate detection/recognition split
How Scene Text Recognition Works
Text Rectification (Optional)
A Spatial Transformer Network (STN) or Thin Plate Spline (TPS) transformation warps curved or distorted text to a roughly horizontal, rectangular shape. This preprocessing step significantly helps recognition of curved text.
Feature Encoding
A CNN (ResNet-45) or ViT encodes the rectified text image into a sequence of feature vectors — one per vertical slice or patch. The encoder must capture character shapes, stroke patterns, and contextual cues.
Sequence Decoding
CTC decoder: predicts character probabilities at each position independently, collapses repetitions. Attention decoder: autoregressively generates characters, attending to different spatial positions for each output character. PARSeq uses parallel prediction with permutation training.
Language Modeling
ABINet and successors integrate explicit language models that refine character predictions using linguistic context ('teh' → 'the'). This corrects visually ambiguous characters (l vs. I vs. 1) using word-level knowledge.
Evaluation
Word-level accuracy on standard benchmarks: IIIT5K, SVT, IC13, IC15, SVTP, CUTE80. Modern SOTA exceeds 97% on most of these. Union14M provides a harder evaluation with 3.2M real-world samples including curved, occluded, and low-resolution text.
Current Landscape
Scene text recognition in 2025 has reached high maturity on standard benchmarks — the top methods (PARSeq, ABINet, CLIP4STR) all exceed 97% on the classic evaluation sets. Research focus has shifted to harder scenarios: the Union14M benchmark with 3.2M challenging real-world samples, multi-line recognition, and cross-lingual text. The integration of language models into recognition (ABINet) was a major advance — using linguistic context to correct visually ambiguous characters. The field is converging with general VLMs, which can read scene text as part of broader image understanding.
Key Challenges
Heavily occluded text — when 30-50% of characters are blocked by objects, shadows, or other text, recognition accuracy drops dramatically
Extreme aspect ratios — very long text strings (URLs, addresses) and very short ones (single characters) require different processing strategies
Out-of-vocabulary words — proper nouns, URLs, product codes, and foreign words that don't appear in training data or language models
Multilingual text — recognizing text in non-Latin scripts (Arabic, Thai, Devanagari) requires script-specific models and training data
Low resolution — text from distant signs or surveillance cameras may be <20px height, pushing below the recognition threshold
Quick Recommendations
Best accuracy
PARSeq or CLIP4STR
97%+ on standard benchmarks; PARSeq's permutation training provides robustness across text lengths and styles
Irregular/curved text
ABINet++ or MAERec
Strong on irregular text thanks to iterative correction and masked autoencoding pretraining
Real-time / mobile
CRNN with MobileNet backbone or PP-OCRv4 recognition
Lightweight models that run at 100+ FPS on GPU, suitable for mobile and embedded deployment
Multilingual
PaddleOCR multilingual recognition or Surya
PaddleOCR supports 80+ language recognition models; Surya optimizes for multilingual accuracy
End-to-end (detection + recognition)
DeepSolo or PaddleOCR v4 pipeline
Single model or tightly integrated pipeline that detects and reads text without separate components
What's Next
The field is moving toward: (1) unified detection + recognition in single models (end-to-end text spotting), (2) reading text in context — using surrounding visual information to disambiguate, (3) video text recognition with temporal aggregation for improved accuracy on moving cameras, and (4) zero-shot recognition of new scripts via visual analogy. VLMs will likely subsume scene text recognition for most applications, with dedicated models persisting only for real-time edge deployment.
Benchmarks & SOTA
svt
Dataset from Papers With Code
State of the Art
CLIP4STR-H (DFN-5B)
99.1
accuracy
iiit5k
Dataset from Papers With Code
State of the Art
CLIP4STR-L (DataComp-1B)
99.6
accuracy
cute80
Dataset from Papers With Code
State of the Art
CPPD
99.7
accuracy
svtp
Dataset from Papers With Code
State of the Art
DTrOCR 105M
98.6
accuracy
icdar-2003
Dataset from Papers With Code
State of the Art
Yet Another Text Recognizer
97.1
accuracy
wost
Dataset from Papers With Code
State of the Art
CLIP4STR-H (DFN-5B)
90.9
1-1-accuracy
uber-text
Dataset from Papers With Code
State of the Art
CLIP4STR-L (DataComp-1B)
92.2
accuracy
host
Dataset from Papers With Code
State of the Art
CLIP4STR-L
82.7
1-1-accuracy
msda
Dataset from Papers With Code
State of the Art
MetaSelf-Learning
42
accuracy
svt-p
Dataset from Papers With Code
State of the Art
ABINet-LV+TPS++
89.6
accuracy
ic13
Dataset from Papers With Code
State of the Art
ABINet-LV+TPS++
97.8
accuracy
Related Tasks
Something wrong or missing?
Help keep Scene Text Recognition benchmarks accurate. Report outdated results, missing benchmarks, or errors.