Benchmark for code summarization (docstring generation) across 6 programming languages: Python, Java, JavaScript, PHP, Ruby, Go. Over 2M (code, docstring) pairs. Primary metric is BLEU-4.
Bleu 4 is the reported evaluation metric for CodeSearchNet. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
| Rank | Model | Trust | Score | Year | Links | Fix |
|---|---|---|---|---|---|---|
| 01 | GPT-4o | verified | 25.3 | 2026 | Source ↗ | Looks wrong? |
| 02 | Qwen2.5-Coder 32B | verified | 23.4 | 2024 | Paper ↗Code ↗ | Looks wrong? |
| 03 | DeepSeek-Coder-V2-Instruct | verified | 22.8 | 2024 | Paper ↗Code ↗ | Looks wrong? |
| 04 | CodeT5+ 2B | verified | 21.36 | 2023 | Paper ↗Code ↗ | Looks wrong? |
| 05 | CodeT5+ | verified | 20.01 | 2023 | Paper ↗Code ↗ | Looks wrong? |
| 06 | UniXcoder | verified | 19.06 | 2022 | Paper ↗Code ↗ | Looks wrong? |
| 07 | CodeBERT | verified | 17.65 | 2020 | Paper ↗Code ↗ | Looks wrong? |
Smoothed Bleu 4 is the reported evaluation metric for CodeSearchNet. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
| Rank | Model | Trust | Score | Year | Links | Fix |
|---|---|---|---|---|---|---|
| 01 | CodeBERT (MLM+RTD) | verified | 15.99 | 2020 | Paper ↗Code ↗ | Looks wrong? |
| 02 | CodeBERT (MLM) | verified | 15.55 | 2020 | Paper ↗Code ↗ | Looks wrong? |
| 03 | pre-train w/ code only | verified | 15.15 | 2020 | Paper ↗Code ↗ | Looks wrong? |
| 04 | CodeBERT (RTD) | verified | 15.03 | 2020 | Paper ↗Code ↗ | Looks wrong? |
| 05 | RoBERTa | verified | 14.52 | 2020 | Paper ↗Code ↗ | Looks wrong? |
| 06 | Transformer | verified | 14.31 | 2020 | Paper ↗Code ↗ | Looks wrong? |
| 07 | seq2seq | verified | 13.36 | 2020 | Paper ↗Code ↗ | Looks wrong? |