300K news articles with multi-sentence summaries. Standard benchmark for abstractive summarization.
ROUGE-based evaluation is saturated. No significant improvements since 2022. Modern summarization uses LLM-as-judge (G-Eval), human preference evaluations, or factual consistency metrics.
15 results indexed across 3 metrics. Shaded row marks current SOTA; ties broken by submission date.
| # | Model | Org | Submitted | Paper / code | rouge-1 |
|---|---|---|---|---|---|
| 01 | BRIOOSS | Yale NLP | Mar 2022 | BRIO: Bringing Order to Abstractive Summarization | 47.78 |
| 02 | GPT-4oAPI | OpenAI | Mar 2023 | GPT-4 Technical Report | 46.30 |
| 03 | Gemini 1.5 ProAPI | Feb 2024 | Gemini 1.5: Unlocking multimodal understanding across mi… | 45.80 | |
| 04 | Llama 3.1 405BOSS | Meta | Jul 2024 | The Llama 3 Herd of Models | 45.10 |
| 05 | Qwen2 72B | Alibaba | Jul 2024 | Qwen2 Technical Report | 44.70 |
| 06 | PEGASUS-LargeOSS | Dec 2019 | PEGASUS: Pre-training with Extracted Gap-sentences for A… | 44.17 |
| # | Model | Org | Submitted | Paper / code | rouge-2 |
|---|---|---|---|---|---|
| 01 | BRIOOSS | Yale NLP | Mar 2022 | BRIO: Bringing Order to Abstractive Summarization | 23.55 |
| 02 | GPT-4oAPI | OpenAI | Mar 2023 | GPT-4 Technical Report | 22.10 |
| 03 | PEGASUS-LargeOSS | Dec 2019 | PEGASUS: Pre-training with Extracted Gap-sentences for A… | 21.47 |
| # | Model | Org | Submitted | Paper / code | rouge-l |
|---|---|---|---|---|---|
| 01 | BRIOOSS | Yale NLP | Mar 2022 | BRIO: Bringing Order to Abstractive Summarization | 44.57 |
| 02 | GPT-4oAPI | OpenAI | Mar 2023 | GPT-4 Technical Report | 43.40 |
| 03 | Gemini 1.5 ProAPI | Feb 2024 | Gemini 1.5: Unlocking multimodal understanding across mi… | 43 | |
| 04 | Llama 3.1 405BOSS | Meta | Jul 2024 | The Llama 3 Herd of Models | 42.30 |
| 05 | Qwen2 72B | Alibaba | Jul 2024 | Qwen2 Technical Report | 41.80 |
| 06 | PEGASUS-LargeOSS | Dec 2019 | PEGASUS: Pre-training with Extracted Gap-sentences for A… | 41.11 |
Each row below marks a model that broke the previous record on rouge-1. Intermediate submissions are kept in the leaderboard above; only SOTA-setting entries are re-listed here.
Higher scores win. Each subsequent entry improved upon the previous best.
Every paper below corresponds to at least one row in the leaderboard above. Click through for the arXiv preprint and, when available, the reference implementation.
Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.