Scene Text Detection
Detecting text regions in natural scene images
Scene text detection finds text in natural images — signs, labels, license plates, graffiti, product packaging, and screen text in the wild. Unlike document OCR (clean pages), scene text is distorted, curved, partially occluded, and embedded in complex backgrounds. DBNet++ and FAST achieve 90%+ F-measure on ICDAR benchmarks, but arbitrary-shaped text in challenging environments remains an active research area.
History
ICDAR Robust Reading Competition launches, establishing text detection and recognition in natural images as a benchmark task
Stroke Width Transform (Epshtein et al.) and MSER-based methods represent the pre-deep-learning state of the art
CTPN (Connectionist Text Proposal Network) adapts Faster R-CNN for horizontal text detection with promising results
EAST (Efficient and Accurate Scene Text) detector achieves real-time text detection with direct geometry prediction, processing 720p at 13 FPS
CRAFT (Character Region Awareness for Text) detects individual characters and links them into words, handling curved and irregular text
DBNet (Differentiable Binarization) introduces a differentiable threshold for post-processing, achieving 87% F-measure on Total-Text (curved text)
ABCNet detects and recognizes curved text end-to-end with Bezier curve representations
DBNet++ improves with adaptive scale fusion; FAST achieves real-time detection with lightweight architecture
Unified models (DeepSolo, ESTextSpotter) jointly handle detection and recognition; VLMs detect text implicitly via visual understanding
How Scene Text Detection Works
Feature Extraction
A backbone (ResNet-50, MobileNet) with FPN produces multi-scale feature maps. Multi-scale is critical because scene text varies enormously in size — from distant signs (10px) to close-up labels (500px).
Text Region Prediction
The model predicts a probability map indicating which pixels belong to text. DBNet adds a learnable threshold map for binarization. Some methods predict character-level heatmaps (CRAFT) or contour points (TextSnake).
Geometry Estimation
For each text region, the model predicts its geometric representation: bounding box (EAST), oriented rectangle, polygon (for curved text), or Bezier curve (ABCNet). Curved text detection requires flexible representations beyond axis-aligned boxes.
Post-Processing
The probability map is binarized using a threshold (fixed or learned), connected components are extracted, and polygonal contours are fitted. Non-maximum suppression removes duplicate detections. DBNet's differentiable binarization makes this step gradient-friendly.
Evaluation
Precision, recall, and F-measure at IoU 0.5. ICDAR 2015 (focused/incidental), Total-Text (curved text), CTW1500 (arbitrary shapes), and MSRA-TD500 (multi-language) are standard benchmarks.
Current Landscape
Scene text detection in 2025 is mature for standard benchmarks — ICDAR 2015 and Total-Text F-measures have plateaued above 90% with multiple methods. DBNet/DBNet++ is the workhorse for production use due to its speed-accuracy balance. The field is consolidating: end-to-end methods that jointly detect and recognize (DeepSolo, ESTextSpotter) are replacing the traditional detect-then-recognize pipeline. Meanwhile, VLMs can localize text in images as an emergent capability, potentially making dedicated text detectors unnecessary for non-real-time applications.
Key Challenges
Arbitrary text shapes — curved signs, circular logos, text on bottles/cans, and vertically oriented text require flexible geometric representations beyond rectangles
Scale variation — a single image might contain text from 5px to 500px height; multi-scale detection is essential but computationally expensive
Dense text — product labels and signs often contain tightly packed text at multiple orientations, causing detection overlap and merging errors
Low contrast and camouflage — text that blends with the background (white text on light backgrounds, transparent overlays) is easily missed
Speed requirements — real-time applications (autonomous driving, AR) need text detection in under 30ms, limiting model complexity
Quick Recommendations
Best accuracy
DBNet++ (ResNet-50) or TextDiffusion
91%+ F-measure on ICDAR 2015; DBNet++ handles most scene text scenarios robustly
Curved/arbitrary-shape text
ABCNet v2 or TextSnake
Bezier curve (ABCNet) and snake (TextSnake) representations handle curved and irregular text shapes
Real-time detection
FAST or EAST
FAST achieves 80%+ F-measure at 50+ FPS on GPU; suitable for mobile and embedded applications
End-to-end detection + recognition
DeepSolo or ESTextSpotter
Single model detects and recognizes in one pass; eliminates the need for a separate recognition model
Multilingual scene text
CRAFT + PaddleOCR recognition
CRAFT's character-level detection handles multiple scripts; PaddleOCR provides multilingual recognition
What's Next
Active research: video scene text detection (tracking text across frames in dashcam/security footage), text detection in 3D (from depth sensors and point clouds), and robustness to extreme conditions (night, rain, motion blur). The long-term trend is absorption into general vision models — text detection will become an implicit capability of VLMs rather than a standalone task. For the near term, efficiency improvements for edge deployment remain the most commercially relevant direction.
Benchmarks & SOTA
ICDAR 2015
ICDAR 2015 Incidental Scene Text
1000 training + 500 test images captured with wearable cameras. Industry standard for scene text detection.
State of the Art
TextFuseNet (ResNeXt-101)
93.96
precision
Total-Text
Total-Text
Curved text benchmark. 1555 images with polygon annotations.
State of the Art
FAST-T-448
152.8
fps
msra-td500
Dataset from Papers With Code
State of the Art
FAST-T-512
137.2
fps
icdar-2013
Dataset from Papers With Code
State of the Art
TrOCR-base 334M
98.4
accuracy
icdar-2017-mlt
Dataset from Papers With Code
State of the Art
PMTD*
84.42
precision
coco-text
Dataset from Papers With Code
State of the Art
CLIP4STR-L
81.9
1-1-accuracy
CTW1500
Curved Text in the Wild 1500
1500 images with curved text annotations. Focus on arbitrary-shaped text.
State of the Art
DBNet++ (ResNet-50) (1024)
Liao et al.
88.5
precision
ic19-art
Dataset from Papers With Code
State of the Art
CLIP4STR-L (DataComp-1B)
86.4
accuracy
Union14M
Union14M: A Unified Benchmark for Scene Text Recognition
Next-generation STR benchmark with 4M labeled + 10M unlabeled images. Accuracy drops 33-48% vs standard benchmarks (IIIT5K etc.), exposing real-world challenges like artistic text, multi-oriented, and occluded text.
State of the Art
CLIP4STR-B
Research
70.8
accuracy
ICDAR 2019 ArT
ICDAR 2019 Arbitrary-Shaped Text
Text in arbitrary shapes including curved and rotated text. 10,166 images total.
State of the Art
pil_maskrcnn
ICT, Chinese Academy of Sciences
82.65
f-measure
ic19-rects
Dataset from Papers With Code
State of the Art
BDN
93.36
f-measure
Related Tasks
Something wrong or missing?
Help keep Scene Text Detection benchmarks accurate. Report outdated results, missing benchmarks, or errors.