Reuters news stories annotated with 4 entity types: PER, ORG, LOC, MISC. The standard NER benchmark.
F1 is the reported evaluation metric for CoNLL-2003. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
| Rank | Model | Trust | Score | Year | Links | Fix |
|---|---|---|---|---|---|---|
| 01 | GLiNER-multitask | verified | 93.8 | 2024 | Paper ↗ | Looks wrong? |
| 02 | DeBERTa-v3-large | verified | 93.4 | 2021 | Paper ↗Source ↗ | Looks wrong? |
| 03 | GPT-4o | verified | 91.7 | 2023 | Paper ↗Source ↗ | Looks wrong? |
| 04 | Llama 3.1 405B | verified | 90.6 | 2024 | Paper ↗ | Looks wrong? |
| 05 | Qwen2 72B | verified | 90.2 | 2024 | Paper ↗ | Looks wrong? |
| 06 | Llama 3 70B | verified | 89.3 | 2024 | Paper ↗ | Looks wrong? |
| 07 | Mistral 7B | verified | 83.5 | 2023 | Paper ↗ | Looks wrong? |