Saturated & Legacy Benchmarks
Benchmarks that have reached saturation (no significant improvements in years) or have been superseded by newer evaluation methods.
Why track saturated benchmarks?
Some benchmarks become "solved" over time - models reach near-human or ceiling performance, making further improvements marginal. Others are superseded by more comprehensive evaluation methods. We flag these so researchers can focus on benchmarks where progress is still meaningful, while preserving historical context for reference.
Saturated Benchmarks
No significant SOTA improvements in 2+ years. Consider using recommended alternatives.
Benchmark near ceiling or stagnant — no meaningful SOTA movement in 2+ years
Benchmark abandoned or no longer evaluated by the community
Competition benchmark from 2019. Top score (Overall F1 ~0.92) held by fglihai and USYD NLP_CS29-2. Research community moved to DocLayNet, D4LA, and DocStructBench for newer benchmarking. No new papers report on this specific test set in 2024-2025.
Benchmark near ceiling or stagnant — no meaningful SOTA movement in 2+ years
Benchmark near ceiling or stagnant — no meaningful SOTA movement in 2+ years
Benchmark near ceiling or stagnant — no meaningful SOTA movement in 2+ years
Benchmark near ceiling or stagnant — no meaningful SOTA movement in 2+ years
ROUGE-based evaluation is saturated. No significant improvements since 2022. Modern summarization uses LLM-as-judge (G-Eval), human preference evaluations, or factual consistency metrics.
Benchmark abandoned or no longer evaluated by the community
Benchmark abandoned or no longer evaluated by the community
Benchmark abandoned or no longer evaluated by the community
Benchmark near ceiling or stagnant — no meaningful SOTA movement in 2+ years
Benchmark abandoned or no longer evaluated by the community
Benchmark abandoned or no longer evaluated by the community
ROUGE-based evaluation is saturated. No significant improvements since 2022. Modern summarization uses LLM-as-judge (G-Eval), human preference evaluations, or factual consistency metrics.
Legacy Benchmarks
Older benchmarks that have been largely superseded by newer datasets or evaluation methods.
Legacy benchmark from 2013. For current OCR evaluation, use OCRBench v2, ICDAR 2015, or newer benchmarks.
Legacy benchmark from 2013. For current OCR evaluation, use OCRBench, ICDAR 2019/2021, or DocVQA.
Legacy table structure benchmark from 2013. Consider using PubTabNet or newer table recognition datasets.