Computer Vision

Scene Text Recognition

Recognizing text in natural scene images

11 datasets127 resultsView full task mapping →

Scene text recognition reads the text content from cropped images of text regions detected in natural scenes. It handles diverse fonts, distortions, partial occlusion, and variable illumination that make it much harder than printed OCR. Modern methods (ABINet, PARSeq, MAERec) achieve 97%+ accuracy on standard benchmarks, but real-world irregular text, especially multilingual, remains challenging.

History

2003

ICDAR scene text recognition competitions begin; early methods use HOG + SVM character classification

2015

CRNN (Shi et al.) combines CNN feature extraction with BiLSTM sequence modeling and CTC loss — becomes the standard architecture for years

2016

Attention-based methods (RARE, ASTER) add spatial transformer networks (STN) to rectify distorted text before recognition

2019

MORAN and ESIR improve text rectification, pushing accuracy on curved text benchmarks significantly

2020

ABINet introduces autonomous, bidirectional, and iterative language modeling into scene text recognition, using linguistic context to correct visual errors

2022

PARSeq (Bautista & Atienza) uses permutation language modeling — reading text in multiple orders during training — achieving 97%+ on standard benchmarks

2023

MAERec applies masked autoencoding to text recognition pretraining, improving performance on irregular and low-quality text

2024

CLIP4STR leverages CLIP's visual-linguistic pretraining for text recognition, bridging scene understanding and reading

2025

Union14M benchmark (Song et al.) provides challenging real-world evaluation; end-to-end spotters eliminate the separate detection/recognition split

How Scene Text Recognition Works

1Text Rectification (O…A Spatial Transformer Netwo…2Feature EncodingA CNN (ResNet-45) or ViT en…3Sequence DecodingCTC decoder: predicts chara…4Language ModelingABINet and successors integ…5EvaluationWord-level accuracy on stan…Scene Text Recognition Pipeline
1

Text Rectification (Optional)

A Spatial Transformer Network (STN) or Thin Plate Spline (TPS) transformation warps curved or distorted text to a roughly horizontal, rectangular shape. This preprocessing step significantly helps recognition of curved text.

2

Feature Encoding

A CNN (ResNet-45) or ViT encodes the rectified text image into a sequence of feature vectors — one per vertical slice or patch. The encoder must capture character shapes, stroke patterns, and contextual cues.

3

Sequence Decoding

CTC decoder: predicts character probabilities at each position independently, collapses repetitions. Attention decoder: autoregressively generates characters, attending to different spatial positions for each output character. PARSeq uses parallel prediction with permutation training.

4

Language Modeling

ABINet and successors integrate explicit language models that refine character predictions using linguistic context ('teh' → 'the'). This corrects visually ambiguous characters (l vs. I vs. 1) using word-level knowledge.

5

Evaluation

Word-level accuracy on standard benchmarks: IIIT5K, SVT, IC13, IC15, SVTP, CUTE80. Modern SOTA exceeds 97% on most of these. Union14M provides a harder evaluation with 3.2M real-world samples including curved, occluded, and low-resolution text.

Current Landscape

Scene text recognition in 2025 has reached high maturity on standard benchmarks — the top methods (PARSeq, ABINet, CLIP4STR) all exceed 97% on the classic evaluation sets. Research focus has shifted to harder scenarios: the Union14M benchmark with 3.2M challenging real-world samples, multi-line recognition, and cross-lingual text. The integration of language models into recognition (ABINet) was a major advance — using linguistic context to correct visually ambiguous characters. The field is converging with general VLMs, which can read scene text as part of broader image understanding.

Key Challenges

Heavily occluded text — when 30-50% of characters are blocked by objects, shadows, or other text, recognition accuracy drops dramatically

Extreme aspect ratios — very long text strings (URLs, addresses) and very short ones (single characters) require different processing strategies

Out-of-vocabulary words — proper nouns, URLs, product codes, and foreign words that don't appear in training data or language models

Multilingual text — recognizing text in non-Latin scripts (Arabic, Thai, Devanagari) requires script-specific models and training data

Low resolution — text from distant signs or surveillance cameras may be <20px height, pushing below the recognition threshold

Quick Recommendations

Best accuracy

PARSeq or CLIP4STR

97%+ on standard benchmarks; PARSeq's permutation training provides robustness across text lengths and styles

Irregular/curved text

ABINet++ or MAERec

Strong on irregular text thanks to iterative correction and masked autoencoding pretraining

Real-time / mobile

CRNN with MobileNet backbone or PP-OCRv4 recognition

Lightweight models that run at 100+ FPS on GPU, suitable for mobile and embedded deployment

Multilingual

PaddleOCR multilingual recognition or Surya

PaddleOCR supports 80+ language recognition models; Surya optimizes for multilingual accuracy

End-to-end (detection + recognition)

DeepSolo or PaddleOCR v4 pipeline

Single model or tightly integrated pipeline that detects and reads text without separate components

What's Next

The field is moving toward: (1) unified detection + recognition in single models (end-to-end text spotting), (2) reading text in context — using surrounding visual information to disambiguate, (3) video text recognition with temporal aggregation for improved accuracy on moving cameras, and (4) zero-shot recognition of new scripts via visual analogy. VLMs will likely subsume scene text recognition for most applications, with dedicated models persisting only for real-time edge deployment.

Benchmarks & SOTA

Related Tasks

Something wrong or missing?

Help keep Scene Text Recognition benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000