Audio-text-to-text is the backbone of voice assistants that actually understand context — models that jointly process speech and text to generate grounded responses. Whisper (2022) cracked robust transcription, but the real leap came when Gemini 1.5 and GPT-4o (2024) began reasoning natively over audio tokens alongside text, eliminating the lossy ASR-then-LLM pipeline. The key challenges are handling overlapping speakers, noisy environments, and preserving prosodic cues like sarcasm or hesitation that pure transcription destroys. Benchmarks like SUPERB and Dynamic-SUPERB are expanding, but real-world spoken dialogue understanding remains far ahead of what leaderboards capture.
Comprehensive evaluation benchmark for voice agents (LLM-based speech assistants) measuring instruction following, robustness to accents/noise/content variations, and task performance across diverse scenarios.
Leading models on VoiceBench.
No results yet. Be the first to contribute.
Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.
3 datasets tracked for this task.
Still looking for something on Audio-Text-to-Text? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.
Real humans read every message. We track what people are asking for and prioritize accordingly.