Saturated & Legacy Benchmarks

Benchmarks that have reached saturation (no significant improvements in years) or have been superseded by newer evaluation methods.

2 Saturated
3 Legacy

Why track saturated benchmarks?

Some benchmarks become "solved" over time - models reach near-human or ceiling performance, making further improvements marginal. Others are superseded by more comprehensive evaluation methods. We flag these so researchers can focus on benchmarks where progress is still meaningful, while preserving historical context for reference.

Saturated Benchmarks

No significant SOTA improvements in 2+ years. Consider using recommended alternatives.

cnn-/-daily-mail(80 results)
Computer Vision/Optical Character Recognition|Last update: Jun 2022

ROUGE-based evaluation is saturated. No significant improvements since 2022. Modern summarization uses LLM-as-judge (G-Eval), human preference evaluations, or factual consistency metrics.

Natural Language Processing/Text Summarization|Last update: Jun 2022

ROUGE-based evaluation is saturated. No significant improvements since 2022. Modern summarization uses LLM-as-judge (G-Eval), human preference evaluations, or factual consistency metrics.

Legacy Benchmarks

Older benchmarks that have been largely superseded by newer datasets or evaluation methods.

icdar2013(39 results)
Computer Vision/Optical Character Recognition|Last update: Jan 2019

Legacy benchmark from 2013. For current OCR evaluation, use OCRBench v2, ICDAR 2015, or newer benchmarks.

icdar-2013(44 results)
Computer Vision/Scene Text Detection|Last update: Jan 2019

Legacy benchmark from 2013. For current OCR evaluation, use OCRBench, ICDAR 2019/2021, or DocVQA.

Computer Vision/Table Recognition|Last update: Jan 2019

Legacy table structure benchmark from 2013. Consider using PubTabNet or newer table recognition datasets.

Use instead: