Multilingual (English, Chinese, Japanese, Korean) diagnostic benchmark evaluating ASR robustness across three out-of-distribution dimensions: environmental degradation (reverberation, noise, clipping), demographic shift (accents, children, older speakers), and linguistic diversity (code-switching, short utterances, incomplete speech). Uses WER for English and CER for CJK languages.
Cer is the reported evaluation metric for WildASR. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Lower is better
| Rank | Model | Trust | Score | Year | Links | Fix |
|---|---|---|---|---|---|---|
| 01 | Gemini 3 Pro | verified | 6.10 | 2025 | Source ↗ | Looks wrong? |
| 02 | GPT-4o Transcribe | verified | 6.40 | 2025 | Source ↗ | Looks wrong? |
| 03 | Gemini 2.5 Pro | verified | 6.70 | 2025 | Source ↗ | Looks wrong? |
| 04 | Whisper Large V3 | verified | 7.50 | 2025 | Source ↗ | Looks wrong? |
| 05 | Scribe V1 | verified | 8.70 | 2025 | Source ↗ | Looks wrong? |
| 06 | Qwen2-Audio | verified | 9.10 | 2025 | Source ↗ | Looks wrong? |
| 07 | Nova 2 | verified | 10.1 | 2025 | Source ↗ | Looks wrong? |
Wer is the reported evaluation metric for WildASR. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Lower is better
| Rank | Model | Trust | Score | Year | Links | Fix |
|---|---|---|---|---|---|---|
| 01 | Gemini 3 Pro | verified | 2.80 | 2025 | Source ↗ | Looks wrong? |
| 02 | GPT-4o Transcribe | verified | 2.80 | 2025 | Source ↗ | Looks wrong? |
| 03 | Gemini 2.5 Pro | verified | 3.60 | 2025 | Source ↗ | Looks wrong? |
| 04 | Scribe V1 | verified | 3.60 | 2025 | Source ↗ | Looks wrong? |
| 05 | Whisper Large V3 | verified | 4.20 | 2025 | Source ↗ | Looks wrong? |
| 06 | Qwen2-Audio | verified | 5.80 | 2025 | Source ↗ | Looks wrong? |
| 07 | Nova 2 | verified | 6.00 | 2025 | Source ↗ | Looks wrong? |