Saturated & Legacy Benchmarks

Benchmarks that have reached saturation (no significant improvements in years) or have been superseded by newer evaluation methods.

15 Saturated
3 Legacy

Why track saturated benchmarks?

Some benchmarks become "solved" over time - models reach near-human or ceiling performance, making further improvements marginal. Others are superseded by more comprehensive evaluation methods. We flag these so researchers can focus on benchmarks where progress is still meaningful, while preserving historical context for reference.

Saturated Benchmarks

No significant SOTA improvements in 2+ years. Consider using recommended alternatives.

rvl-cdip(37 results)
Computer Vision/Document Image Classification

Benchmark near ceiling or stagnant — no meaningful SOTA movement in 2+ years

Computer Vision/Document Layout Analysis

Benchmark abandoned or no longer evaluated by the community

Computer Vision/Document Layout Analysis|Last update: Sep 2019

Competition benchmark from 2019. Top score (Overall F1 ~0.92) held by fglihai and USYD NLP_CS29-2. Research community moved to DocLayNet, D4LA, and DocStructBench for newer benchmarking. No new papers report on this specific test set in 2024-2025.

Computer Vision/Image Classification

Benchmark near ceiling or stagnant — no meaningful SOTA movement in 2+ years

Computer Vision/Image Classification

Benchmark near ceiling or stagnant — no meaningful SOTA movement in 2+ years

Computer Vision/Image Classification

Benchmark near ceiling or stagnant — no meaningful SOTA movement in 2+ years

Computer Vision/Image Classification

Benchmark near ceiling or stagnant — no meaningful SOTA movement in 2+ years

cnn-/-daily-mail(101 results)
Computer Vision/Optical Character Recognition|Last update: Jun 2022

ROUGE-based evaluation is saturated. No significant improvements since 2022. Modern summarization uses LLM-as-judge (G-Eval), human preference evaluations, or factual consistency metrics.

e2e(45 results)
Computer Vision/Optical Character Recognition

Benchmark abandoned or no longer evaluated by the community

howsumm-method(12 results)
Computer Vision/Optical Character Recognition

Benchmark abandoned or no longer evaluated by the community

howsumm-step(16 results)
Computer Vision/Optical Character Recognition

Benchmark abandoned or no longer evaluated by the community

pendigits(20 results)
Computer Vision/Optical Character Recognition

Benchmark near ceiling or stagnant — no meaningful SOTA movement in 2+ years

wikibio(18 results)
Computer Vision/Optical Character Recognition

Benchmark abandoned or no longer evaluated by the community

icdar-2003(12 results)
Computer Vision/Scene Text Recognition

Benchmark abandoned or no longer evaluated by the community

Natural Language Processing/Text Summarization|Last update: Jun 2022

ROUGE-based evaluation is saturated. No significant improvements since 2022. Modern summarization uses LLM-as-judge (G-Eval), human preference evaluations, or factual consistency metrics.

Legacy Benchmarks

Older benchmarks that have been largely superseded by newer datasets or evaluation methods.

icdar2013(39 results)
Computer Vision/Optical Character Recognition|Last update: Jan 2019

Legacy benchmark from 2013. For current OCR evaluation, use OCRBench v2, ICDAR 2015, or newer benchmarks.

icdar-2013(59 results)
Computer Vision/Scene Text Detection|Last update: Jan 2019

Legacy benchmark from 2013. For current OCR evaluation, use OCRBench, ICDAR 2019/2021, or DocVQA.

Computer Vision/Table Recognition|Last update: Jan 2019

Legacy table structure benchmark from 2013. Consider using PubTabNet or newer table recognition datasets.

Use instead: