CNN/DailyMail Summarization.

300K news articles with multi-sentence summaries. Standard benchmark for abstractive summarization.

Saturated benchmark· last significant update Jun 2022

ROUGE-based evaluation is saturated. No significant improvements since 2022. Modern summarization uses LLM-as-judge (G-Eval), human preference evaluations, or factual consistency metrics.

Paper ↗Download dataset Submit a result ↵

§ 01 · Leaderboard

Best published scores.

15 results indexed across 3 metrics. Shaded row marks current SOTA; ties broken by submission date.

Primary: rouge-1 · higher is better
All metrics: rouge-1, rouge-2, rouge-l

rouge-1· primary

6 rows

#	Model	Org	Submitted	Paper / code	rouge-1
01	BRIOOSS	Yale NLP	Mar 2022	BRIO: Bringing Order to Abstractive Summarization	47.78
02	GPT-4oAPI	OpenAI	Mar 2023	GPT-4 Technical Report	46.30
03	Gemini 1.5 ProAPI	Google	Feb 2024	Gemini 1.5: Unlocking multimodal understanding across mi…	45.80
04	Llama 3.1 405BOSS	Meta	Jul 2024	The Llama 3 Herd of Models	45.10
05	Qwen2 72B	Alibaba	Jul 2024	Qwen2 Technical Report	44.70
06	PEGASUS-LargeOSS	Google	Dec 2019	PEGASUS: Pre-training with Extracted Gap-sentences for A…	44.17

rouge-2

3 rows

#	Model	Org	Submitted	Paper / code	rouge-2
01	BRIOOSS	Yale NLP	Mar 2022	BRIO: Bringing Order to Abstractive Summarization	23.55
02	GPT-4oAPI	OpenAI	Mar 2023	GPT-4 Technical Report	22.10
03	PEGASUS-LargeOSS	Google	Dec 2019	PEGASUS: Pre-training with Extracted Gap-sentences for A…	21.47

rouge-l

6 rows

#	Model	Org	Submitted	Paper / code	rouge-l
01	BRIOOSS	Yale NLP	Mar 2022	BRIO: Bringing Order to Abstractive Summarization	44.57
02	GPT-4oAPI	OpenAI	Mar 2023	GPT-4 Technical Report	43.40
03	Gemini 1.5 ProAPI	Google	Feb 2024	Gemini 1.5: Unlocking multimodal understanding across mi…	43
04	Llama 3.1 405BOSS	Meta	Jul 2024	The Llama 3 Herd of Models	42.30
05	Qwen2 72B	Alibaba	Jul 2024	Qwen2 Technical Report	41.80
06	PEGASUS-LargeOSS	Google	Dec 2019	PEGASUS: Pre-training with Extracted Gap-sentences for A…	41.11

Fig 2 · Rows sorted by score within each metric. Shaded row marks SOTA. Dates reflect model or paper release where available, otherwise the date Codesota accessed the source.

§ 03 · Progress

2 steps
of state of the art.

Each row below marks a model that broke the previous record on rouge-1. Intermediate submissions are kept in the leaderboard above; only SOTA-setting entries are re-listed here.

Higher scores win. Each subsequent entry improved upon the previous best.

SOTA line · rouge-1

Dec 18, 2019PEGASUS-LargeGoogle44.17
Mar 31, 2022BRIOYale NLP47.78

Fig 3 · SOTA-setting models only. 2 entries span Dec 2019 → Mar 2022.

§ 04 · Literature

6 papers
tied to this benchmark.

Every paper below corresponds to at least one row in the leaderboard above. Click through for the arXiv preprint and, when available, the reference implementation.

The Llama 3 Herd of Models
Jul 2024·Llama 3.1 405B
arXiv ↗
Qwen2 Technical Report
Jul 2024·Qwen2 72B
arXiv ↗
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Feb 2024·Gemini 1.5 Pro
arXiv ↗
GPT-4 Technical Report
Mar 2023·GPT-4o
arXiv ↗
BRIO: Bringing Order to Abstractive Summarization
ACL 2022Mar 2022·BRIO
arXiv ↗
PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization
Dec 2019·PEGASUS-Large
arXiv ↗

§ 06 · Contribute

Have a score that beats
this table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.

Submit a result ↵Read submission guide

What a submission needs

01A public checkpoint or API endpoint
02A reproduction script with frozen commit + seed
03Declared evaluation environment (Python, deps)
04One row per metric declared by this dataset
05A contact so we can follow up on discrepancies

CNN/DailyMail Summarization.

Best published scores.

2 stepsof state of the art.

6 paperstied to this benchmark.

Have a score that beatsthis table?

2 steps
of state of the art.

6 papers
tied to this benchmark.

Have a score that beats
this table?