e2e is a state-of-the-art machine learning benchmark indexed on Codesota. This page tracks published model results, top scores per metric, and the SOTA timeline for e2e.
Rouge L is the reported evaluation metric for e2e. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
| Rank | Model | Trust | Score | Year | Links | Fix |
|---|---|---|---|---|---|---|
| 01 | GPT-2-Large (prefix-tuning) | verified | 71.7 | 2021 | Paper ↗ | Looks wrong? |
| 02 | GPT-2-Medium (prefix-tuning) | verified | 71.4 | 2021 | Paper ↗ | Looks wrong? |
| 03 | HTLM (prefix-tuning) | verified | 71.2 | 2021 | Paper ↗ | Looks wrong? |
| 04 | GPT-2-Medium (fine-tuning) | verified | 71 | 2021 | Paper ↗ | Looks wrong? |
| 05 | HTLM (fine-tuning) | verified | 70.8 | 2021 | Paper ↗ | Looks wrong? |
| 06 | GPT-2-Large (fine-tuning) | verified | 69.9 | 2021 | Paper ↗ | Looks wrong? |
| 07 | T5-base (STSM) | verified | 68.97 | 2024 | Paper ↗ | Looks wrong? |
| 08 | BART-base (STSM) | verified | 68.76 | 2024 | Paper ↗ | Looks wrong? |
| 09 | FLAN-T5-base (STSM) | verified | 67.85 | 2024 | Paper ↗ | Looks wrong? |
Bleu is the reported evaluation metric for e2e. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
| Rank | Model | Trust | Score | Year | Links | Fix |
|---|---|---|---|---|---|---|
| 01 | GPT-2-Large (prefix-tuning) | verified | 70.3 | 2021 | Paper ↗ | Looks wrong? |
| 02 | HTLM (fine-tuning) | verified | 70.3 | 2021 | Paper ↗ | Looks wrong? |
| 03 | HTLM (prefix-tuning) | verified | 70.1 | 2021 | Paper ↗ | Looks wrong? |
| 04 | GPT-2-Medium (prefix-tuning) | verified | 69.7 | 2021 | Paper ↗ | Looks wrong? |
| 05 | GPT-2-Large (fine-tuning) | verified | 68.5 | 2021 | Paper ↗ | Looks wrong? |
| 06 | GPT-2-Medium (fine-tuning) | verified | 68.2 | 2021 | Paper ↗ | Looks wrong? |
| 07 | T5-base (STSM) | verified | 66.95 | 2024 | Paper ↗ | Looks wrong? |
| 08 | BART-base (STSM) | verified | 65.74 | 2024 | Paper ↗ | Looks wrong? |
| 09 | FLAN-T5-base (STSM) | verified | 65.65 | 2024 | Paper ↗ | Looks wrong? |
Meteor is the reported evaluation metric for e2e. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
| Rank | Model | Trust | Score | Year | Links | Fix |
|---|---|---|---|---|---|---|
| 01 | HTLM (fine-tuning) | verified | 46.3 | 2021 | Paper ↗ | Looks wrong? |
| 02 | GPT-2-Large (prefix-tuning) | verified | 46.2 | 2021 | Paper ↗ | Looks wrong? |
| 03 | GPT-2-Medium (fine-tuning) | verified | 46.2 | 2021 | Paper ↗ | Looks wrong? |
| 04 | HTLM (prefix-tuning) | verified | 46.1 | 2021 | Paper ↗ | Looks wrong? |
| 05 | GPT-2-Medium (prefix-tuning) | verified | 46.1 | 2021 | Paper ↗ | Looks wrong? |
| 06 | GPT-2-Large (fine-tuning) | verified | 46 | 2021 | Paper ↗ | Looks wrong? |
| 07 | T5-base (STSM) | verified | 45.7 | 2024 | Paper ↗ | Looks wrong? |
| 08 | BART-base (STSM) | verified | 45.6 | 2024 | Paper ↗ | Looks wrong? |
| 09 | FLAN-T5-base (STSM) | verified | 45.54 | 2024 | Paper ↗ | Looks wrong? |
Nist is the reported evaluation metric for e2e. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
| Rank | Model | Trust | Score | Year | Links | Fix |
|---|---|---|---|---|---|---|
| 01 | HTLM (fine-tuning) | verified | 8.90 | 2021 | Paper ↗ | Looks wrong? |
| 02 | HTLM (prefix-tuning) | verified | 8.85 | 2021 | Paper ↗ | Looks wrong? |
| 03 | GPT-2-Large (prefix-tuning) | verified | 8.85 | 2021 | Paper ↗ | Looks wrong? |
| 04 | GPT-2-Medium (prefix-tuning) | verified | 8.81 | 2021 | Paper ↗ | Looks wrong? |
| 05 | GPT-2-Large (fine-tuning) | verified | 8.78 | 2021 | Paper ↗ | Looks wrong? |
| 06 | GPT-2-Medium (fine-tuning) | verified | 8.62 | 2021 | Paper ↗ | Looks wrong? |
| 07 | T5-base (STSM) | verified | 8.59 | 2024 | Paper ↗ | Looks wrong? |
| 08 | FLAN-T5-base (STSM) | verified | 8.49 | 2024 | Paper ↗ | Looks wrong? |
| 09 | BART-base (STSM) | verified | 8.46 | 2024 | Paper ↗ | Looks wrong? |
Cider is the reported evaluation metric for e2e. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
| Rank | Model | Trust | Score | Year | Links | Fix |
|---|---|---|---|---|---|---|
| 01 | GPT-2-Medium (prefix-tuning) | verified | 2.49 | 2021 | Paper ↗ | Looks wrong? |
| 02 | HTLM (fine-tuning) | verified | 2.47 | 2021 | Paper ↗ | Looks wrong? |
| 03 | GPT-2-Medium (fine-tuning) | verified | 2.47 | 2021 | Paper ↗ | Looks wrong? |
| 04 | GPT-2-Large (prefix-tuning) | verified | 2.47 | 2021 | Paper ↗ | Looks wrong? |
| 05 | GPT-2-Large (fine-tuning) | verified | 2.45 | 2021 | Paper ↗ | Looks wrong? |
| 06 | HTLM (prefix-tuning) | verified | 2.45 | 2021 | Paper ↗ | Looks wrong? |
| 07 | T5-base (STSM) | verified | 2.27 | 2024 | Paper ↗ | Looks wrong? |
| 08 | BART-base (STSM) | verified | 2.20 | 2024 | Paper ↗ | Looks wrong? |
| 09 | FLAN-T5-base (STSM) | verified | 2.12 | 2024 | Paper ↗ | Looks wrong? |