Computer Vision

Scene Text Detection

Detecting text regions in natural scene images

11 datasets581 resultsView full task mapping →

Scene text detection finds text in natural images — signs, labels, license plates, graffiti, product packaging, and screen text in the wild. Unlike document OCR (clean pages), scene text is distorted, curved, partially occluded, and embedded in complex backgrounds. DBNet++ and FAST achieve 90%+ F-measure on ICDAR benchmarks, but arbitrary-shaped text in challenging environments remains an active research area.

History

2003

ICDAR Robust Reading Competition launches, establishing text detection and recognition in natural images as a benchmark task

2012

Stroke Width Transform (Epshtein et al.) and MSER-based methods represent the pre-deep-learning state of the art

2016

CTPN (Connectionist Text Proposal Network) adapts Faster R-CNN for horizontal text detection with promising results

2017

EAST (Efficient and Accurate Scene Text) detector achieves real-time text detection with direct geometry prediction, processing 720p at 13 FPS

2018

CRAFT (Character Region Awareness for Text) detects individual characters and links them into words, handling curved and irregular text

2019

DBNet (Differentiable Binarization) introduces a differentiable threshold for post-processing, achieving 87% F-measure on Total-Text (curved text)

2020

ABCNet detects and recognizes curved text end-to-end with Bezier curve representations

2022

DBNet++ improves with adaptive scale fusion; FAST achieves real-time detection with lightweight architecture

2024

Unified models (DeepSolo, ESTextSpotter) jointly handle detection and recognition; VLMs detect text implicitly via visual understanding

How Scene Text Detection Works

Feature Extraction

A backbone (ResNet-50, MobileNet) with FPN produces multi-scale feature maps. Multi-scale is critical because scene text varies enormously in size — from distant signs (10px) to close-up labels (500px).

Text Region Prediction

The model predicts a probability map indicating which pixels belong to text. DBNet adds a learnable threshold map for binarization. Some methods predict character-level heatmaps (CRAFT) or contour points (TextSnake).

Geometry Estimation

For each text region, the model predicts its geometric representation: bounding box (EAST), oriented rectangle, polygon (for curved text), or Bezier curve (ABCNet). Curved text detection requires flexible representations beyond axis-aligned boxes.

Post-Processing

The probability map is binarized using a threshold (fixed or learned), connected components are extracted, and polygonal contours are fitted. Non-maximum suppression removes duplicate detections. DBNet's differentiable binarization makes this step gradient-friendly.

Evaluation

Precision, recall, and F-measure at IoU 0.5. ICDAR 2015 (focused/incidental), Total-Text (curved text), CTW1500 (arbitrary shapes), and MSRA-TD500 (multi-language) are standard benchmarks.

Current Landscape

Scene text detection in 2025 is mature for standard benchmarks — ICDAR 2015 and Total-Text F-measures have plateaued above 90% with multiple methods. DBNet/DBNet++ is the workhorse for production use due to its speed-accuracy balance. The field is consolidating: end-to-end methods that jointly detect and recognize (DeepSolo, ESTextSpotter) are replacing the traditional detect-then-recognize pipeline. Meanwhile, VLMs can localize text in images as an emergent capability, potentially making dedicated text detectors unnecessary for non-real-time applications.

Key Challenges

Arbitrary text shapes — curved signs, circular logos, text on bottles/cans, and vertically oriented text require flexible geometric representations beyond rectangles

Scale variation — a single image might contain text from 5px to 500px height; multi-scale detection is essential but computationally expensive

Dense text — product labels and signs often contain tightly packed text at multiple orientations, causing detection overlap and merging errors

Low contrast and camouflage — text that blends with the background (white text on light backgrounds, transparent overlays) is easily missed

Speed requirements — real-time applications (autonomous driving, AR) need text detection in under 30ms, limiting model complexity

Quick Recommendations

Best accuracy

DBNet++ (ResNet-50) or TextDiffusion

91%+ F-measure on ICDAR 2015; DBNet++ handles most scene text scenarios robustly

Curved/arbitrary-shape text

ABCNet v2 or TextSnake

Bezier curve (ABCNet) and snake (TextSnake) representations handle curved and irregular text shapes

Real-time detection

FAST or EAST

FAST achieves 80%+ F-measure at 50+ FPS on GPU; suitable for mobile and embedded applications

End-to-end detection + recognition

DeepSolo or ESTextSpotter

Single model detects and recognizes in one pass; eliminates the need for a separate recognition model

Multilingual scene text

CRAFT + PaddleOCR recognition

CRAFT's character-level detection handles multiple scripts; PaddleOCR provides multilingual recognition

What's Next

Active research: video scene text detection (tracking text across frames in dashcam/security footage), text detection in 3D (from depth sensors and point clouds), and robustness to extreme conditions (night, rain, motion blur). The long-term trend is absorption into general vision models — text detection will become an implicit capability of VLMs rather than a standalone task. For the near term, efficiency improvements for edge deployment remain the most commercially relevant direction.

Benchmarks & SOTA

ICDAR 2015

ICDAR 2015 Incidental Scene Text

2015188 results

1000 training + 500 test images captured with wearable cameras. Industry standard for scene text detection.

State of the Art

TextFuseNet (ResNeXt-101)

93.96

precision

Total-Text

2017126 results

Curved text benchmark. 1555 images with polygon annotations.

State of the Art

FAST-T-448

152.8

fps

msra-td500

202079 results

Dataset from Papers With Code

State of the Art

FAST-T-512

137.2

fps

icdar-2013

202059 results

Dataset from Papers With Code

State of the Art

JSTR

Fujitake

99.2

accuracy

icdar-2017-mlt

202054 results

Dataset from Papers With Code

State of the Art

PMTD*

84.42

precision

coco-text

202033 results

Dataset from Papers With Code

State of the Art

CLIP4STR-L

81.9

1-1-accuracy

CTW1500

Curved Text in the Wild 1500

201918 results

1500 images with curved text annotations. Focus on arbitrary-shaped text.

State of the Art

DBNet++ (ResNet-50) (1024)

Liao et al.

88.5

precision

ic19-art

202011 results

Dataset from Papers With Code

State of the Art

CLIP4STR-L (DataComp-1B)

86.4

accuracy

Union14M

Union14M: A Unified Benchmark for Scene Text Recognition

20238 results

Next-generation STR benchmark with 4M labeled + 10M unlabeled images. Accuracy drops 33-48% vs standard benchmarks (IIIT5K etc.), exposing real-world challenges like artistic text, multi-oriented, and occluded text.

State of the Art

CLIP4STR-B

Research

70.8

accuracy

ICDAR 2019 ArT

ICDAR 2019 Arbitrary-Shaped Text

20194 results

Text in arbitrary shapes including curved and rotated text. 10,166 images total.

State of the Art

pil_maskrcnn

ICT, Chinese Academy of Sciences

82.65

f-measure

ic19-rects

20201 results

Dataset from Papers With Code

State of the Art

BDN

93.36

f-measure

Related Tasks

Open-Vocabulary Object Detection

Object detection with open vocabulary - detecting objects from arbitrary text descriptions without being limited to a fixed set of categories.

Video segmentation

Video segmentation is the task of partitioning video frames into multiple segments or objects. Unlike image segmentation which works on static images, video segmentation tracks objects across frames in a video sequence.

Object counting

Object counting in AI is a computer vision task that uses machine learning and image processing to identify and enumerate distinct objects within digital images and videos. It can differentiate between various object types, sizes, and shapes, even in crowded or dynamically changing scenes. The process typically involves object detection using deep learning models like convolutional neural networks (CNNs) to recognize and localize objects, followed by aggregation to provide a total count. This technology is applied in fields like manufacturing for quality control and production monitoring.

Image editing

Image editing is the process of altering and improving images, whether digital or traditional, using specialized tools and software to enhance their quality, appearance, and functionality. This can involve simple tasks like cropping and color correction or complex techniques such as layering, retouching to remove blemishes, and creating new composite images. The goal of image editing is to make images more aesthetically pleasing, correct flaws, or achieve a desired artistic effect.

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Scene Text Detection benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to Computer Vision