Saturated & Legacy Benchmarks
Benchmarks that have reached saturation (no significant improvements in years) or have been superseded by newer evaluation methods.
Why track saturated benchmarks?
Some benchmarks become "solved" over time - models reach near-human or ceiling performance, making further improvements marginal. Others are superseded by more comprehensive evaluation methods. We flag these so researchers can focus on benchmarks where progress is still meaningful, while preserving historical context for reference.
Saturated Benchmarks
No significant SOTA improvements in 2+ years. Consider using recommended alternatives.
ROUGE-based evaluation is saturated. No significant improvements since 2022. Modern summarization uses LLM-as-judge (G-Eval), human preference evaluations, or factual consistency metrics.
ROUGE-based evaluation is saturated. No significant improvements since 2022. Modern summarization uses LLM-as-judge (G-Eval), human preference evaluations, or factual consistency metrics.
Legacy Benchmarks
Older benchmarks that have been largely superseded by newer datasets or evaluation methods.
Legacy benchmark from 2013. For current OCR evaluation, use OCRBench v2, ICDAR 2015, or newer benchmarks.
Legacy benchmark from 2013. For current OCR evaluation, use OCRBench, ICDAR 2019/2021, or DocVQA.
Legacy table structure benchmark from 2013. Consider using PubTabNet or newer table recognition datasets.