Audio-Text-to-Text
Audio-text-to-text is the backbone of voice assistants that actually understand context — models that jointly process speech and text to generate grounded responses. Whisper (2022) cracked robust transcription, but the real leap came when Gemini 1.5 and GPT-4o (2024) began reasoning natively over audio tokens alongside text, eliminating the lossy ASR-then-LLM pipeline. The key challenges are handling overlapping speakers, noisy environments, and preserving prosodic cues like sarcasm or hesitation that pure transcription destroys. Benchmarks like SUPERB and Dynamic-SUPERB are expanding, but real-world spoken dialogue understanding remains far ahead of what leaderboards capture.
VoiceBench
Comprehensive evaluation benchmark for voice agents (LLM-based speech assistants) measuring instruction following, robustness to accents/noise/content variations, and task performance across diverse scenarios.
Top 10
Leading models on VoiceBench.
| Rank | Model | overall-score | Year | Source |
|---|---|---|---|---|
| 1 | Ultravox-GLM-4P7 | 88.9 | 2026 | paper |
| 2 | Whisper-v3-large + GPT-4o (cascade) | 87.8 | 2026 | paper |
| 3 | GPT-4o-Audio | 86.8 | 2026 | paper |
| 4 | Whisper-v3-large + LLaMA-3.1-8B (cascade) | 77.5 | 2026 | paper |
| 5 | Kimi-Audio | 76.9 | 2026 | paper |
| 6 | MiniCPM-o | 71.2 | 2026 | paper |
| 7 | VITA-1.5 | 64.5 | 2026 | paper |
| 8 | Qwen2-Audio | 55.8 | 2026 | paper |
| 9 | LLaMA-Omni | 41.1 | 2026 | paper |
| 10 | VITA-1.0 | 36.4 | 2026 | paper |
All datasets
2 datasets tracked for this task.
Related tasks
Other tasks in Multimodal.
Looking to run a model? HuggingFace hosts inference for this task type.