Codesota · Benchmark · CNN/DailyMailHome/Leaderboards/Language & Knowledge/Text Summarization/CNN/DailyMail
Unknown

CNN/DailyMail.

300K news articles with multi-sentence summaries. Standard benchmark for abstractive summarization.

Paper Leaderboard
§ 01 · SOTA history

Year over year.

§ 02 · Leaderboard

Results by metric.

Found a wrong score or missing run?
Use row edits to send a sourced correction into moderation.
Add / edit result Report issue

rouge-1

Rouge 1 is the reported evaluation metric for CNN/DailyMail. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for rouge-1verifiedpapervendorcommunityunverified
RankModelTrustScoreYearLinksFix
01BRIO
BRIO (BART-large backbone). Source: Table 1, arxiv:2203.16804.
verified47.782022Paper ↗Looks wrong?
02GPT-4o
GPT-4o zero-shot summarization.
verified46.32023Paper ↗Source ↗Looks wrong?
03Gemini 1.5 Pro
Gemini 1.5 Pro zero-shot. Source: Gemini 1.5 technical report.
verified45.82024Paper ↗Looks wrong?
04Llama 3.1 405B
Llama 3.1 405B Instruct zero-shot summarization. Source: Llama 3 paper.
verified45.12024Paper ↗Looks wrong?
05Qwen2 72B
Qwen2 72B Instruct zero-shot. Source: Qwen2 technical report.
verified44.72024Paper ↗Looks wrong?
06PEGASUS-Large
PEGASUS-Large. Source: Table 1, arxiv:1912.08777.
verified44.172019Paper ↗Looks wrong?

rouge-l

Rouge L is the reported evaluation metric for CNN/DailyMail. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for rouge-lverifiedpapervendorcommunityunverified
RankModelTrustScoreYearLinksFix
01BRIO
BRIO. Source: Table 1, arxiv:2203.16804.
verified44.572022Paper ↗Looks wrong?
02GPT-4o
GPT-4o zero-shot summarization.
verified43.42023Paper ↗Source ↗Looks wrong?
03Gemini 1.5 Pro
Gemini 1.5 Pro zero-shot. Source: Gemini 1.5 technical report.
verified432024Paper ↗Looks wrong?
04Llama 3.1 405B
Llama 3.1 405B Instruct zero-shot summarization. Source: Llama 3 paper.
verified42.32024Paper ↗Looks wrong?
05Qwen2 72B
Qwen2 72B Instruct. Source: Qwen2 technical report.
verified41.82024Paper ↗Looks wrong?
06PEGASUS-Large
PEGASUS-Large. Source: Table 1, arxiv:1912.08777.
verified41.112019Paper ↗Looks wrong?
07BARTunverified40.92019Paper ↗Code ↗Looks wrong?

rouge-2

Rouge 2 is the reported evaluation metric for CNN/DailyMail. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for rouge-2verifiedpapervendorcommunityunverified
RankModelTrustScoreYearLinksFix
01BRIO
BRIO. Source: Table 1, arxiv:2203.16804.
verified23.552022Paper ↗Looks wrong?
02GPT-4o
GPT-4o zero-shot summarization.
verified22.12023Paper ↗Source ↗Looks wrong?
03PEGASUS-Large
PEGASUS-Large. Source: Table 1, arxiv:1912.08777.
verified21.472019Paper ↗Looks wrong?
§ 04 · Submit a result

Add to the leaderboard.

← Back to Text Summarization