Dataset from Papers With Code
ROUGE-based evaluation is saturated. No significant improvements since 2022. Modern summarization uses LLM-as-judge (G-Eval), human preference evaluations, or factual consistency metrics.
101 results indexed across 4 metrics. Shaded row marks current SOTA; ties broken by submission date.
| # | Model | Org | Submitted | Paper / code | ppl |
|---|---|---|---|---|---|
| 01 | Bottom-Up Sum | — | Aug 2018 | Bottom-Up Abstractive Summarization · code | 32.75 |
| 02 | C2F + ALTERNATE | — | Sep 2017 | papers-with-code | 23.60 |
Every paper below corresponds to at least one row in the leaderboard above. Click through for the arXiv preprint and, when available, the reference implementation.
Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.