Audio-Text-to-Text
Audio-text-to-text is the backbone of voice assistants that actually understand context — models that jointly process speech and text to generate grounded responses. Whisper (2022) cracked robust transcription, but the real leap came when Gemini 1.5 and GPT-4o (2024) began reasoning natively over audio tokens alongside text, eliminating the lossy ASR-then-LLM pipeline. The key challenges are handling overlapping speakers, noisy environments, and preserving prosodic cues like sarcasm or hesitation that pure transcription destroys. Benchmarks like SUPERB and Dynamic-SUPERB are expanding, but real-world spoken dialogue understanding remains far ahead of what leaderboards capture.
Audio-text-to-text models jointly process speech or audio alongside text prompts to produce text outputs — enabling spoken-language understanding, audio captioning, and speech-grounded Q&A. This task bridges the gap between ASR pipelines and true audio comprehension, where tone, music, and environmental sounds all carry meaning.
History
wav2vec 2.0 demonstrates self-supervised speech representations that rival supervised ASR with minimal labels
HuBERT introduces masked prediction pretraining for speech, achieving SOTA on LibriSpeech
Whisper (OpenAI) ships a 1.5B-param multilingual ASR model trained on 680k hours of web audio
AudioPaLM (Google) fuses PaLM-2 with audio tokens for speech-to-speech translation and understanding
Qwen-Audio processes arbitrary audio inputs (speech, music, environmental sounds) with a unified encoder
Gemini 1.5 Pro natively ingests long-form audio (up to hours) alongside text and images
GPT-4o introduces real-time audio reasoning with sub-300ms latency in voice mode
Qwen2-Audio and Gemini 2.0 push multilingual audio understanding with native multi-turn dialogue over audio
How Audio-Text-to-Text Works
Audio Encoding
Raw waveform is converted to a spectrogram or learned audio tokens via an encoder (Whisper encoder, audio-specific ViT, or HuBERT). The encoder captures phonetic, prosodic, and acoustic features.
Feature Projection
Audio embeddings are projected into the same embedding space as the LLM's text tokens, often via a lightweight adapter (linear projection, Q-Former, or cross-attention bridge).
Interleaved Decoding
The LLM processes the interleaved sequence of audio and text tokens autoregressively, attending over both modalities to generate text outputs.
Post-processing
Output text may include transcriptions, summaries, answers to questions about audio content, or structured metadata like speaker diarization and sentiment labels.
Current Landscape
The audio-text-to-text landscape in 2025 is rapidly consolidating around natively multimodal LLMs that can ingest raw audio without a separate ASR stage. Gemini 2.0 and GPT-4o lead the proprietary space with native audio understanding, while Qwen2-Audio dominates open-source. The old pipeline of ASR → NLP is giving way to end-to-end models that preserve prosody, tone, and non-speech audio cues. However, two-stage approaches (Whisper + LLM) remain competitive for pure transcription tasks and offer better controllability. Key differentiators now are long-context audio handling, real-time streaming support, and non-speech audio comprehension.
Key Challenges
Long-form audio context — encoding a 1-hour podcast into a manageable token budget without losing temporal detail
Speaker attribution and diarization — models struggle to track who said what in multi-speaker settings
Non-speech audio understanding — environmental sounds, music analysis, and acoustic scene classification lag behind speech tasks
Multilingual and code-switched audio — performance drops sharply on low-resource languages and mixed-language speech
Hallucination under noise — models confabulate transcriptions when audio is noisy or low-quality
Quick Recommendations
Best overall
Gemini 2.0 Flash
Natively processes long-form audio with text, strong multilingual support, fast inference, and competitive pricing
Best accuracy on speech tasks
GPT-4o
Highest accuracy on speech understanding benchmarks with real-time audio reasoning capabilities
Open source
Qwen2-Audio-7B-Instruct
Best open-weight audio-language model; handles speech, music, and sound effects in a unified architecture
ASR + understanding pipeline
Whisper Large-v3 + GPT-4o
Whisper handles robust transcription, GPT-4o reasons over the transcript — still the most reliable two-stage approach
On-device / edge
Whisper Small + Phi-3-mini
Sub-1GB total footprint, runs on mobile devices with acceptable latency for transcription + summarization
What's Next
The next frontier is real-time audio agents that can interrupt, ask clarifying questions, and maintain multi-turn conversations over streaming audio. Expect models to handle 10+ hour audio contexts natively, improve speaker diarization without external tools, and develop richer understanding of music, emotion, and acoustic environments. Audio-visual grounding — where models jointly process video and audio tracks — will become standard in multimodal systems.
Benchmarks & SOTA
AudioBench
Comprehensive evaluation of audio understanding language models
No results tracked yet
VoiceBench
VoiceBench: Benchmarking LLM-Based Voice Assistants
Comprehensive evaluation benchmark for voice agents (LLM-based speech assistants) measuring instruction following, robustness to accents/noise/content variations, and task performance across diverse scenarios.
No results tracked yet
Related Tasks
Image-Text-to-Image
Image-text-to-image covers instruction-guided image editing — taking a source image plus a text command and producing a modified result. InstructPix2Pix (2023) demonstrated this could work zero-shot, and subsequent models like DALL-E 3's inpainting, Ideogram, and Stable Diffusion's img2img pipelines made it practical. The core difficulty is surgical precision: users want "change the dress to red" without altering the face, background, or lighting, which requires disentangling content from style at a level current architectures still fumble. Benchmarks are fragmented across editing fidelity, instruction-following, and identity preservation, making unified comparison difficult.
Image-Text-to-Text
Image-text-to-text exploded from a research curiosity to the dominant AI interface in under two years. GPT-4V (2023) proved multimodal LLMs could reason over images, Gemini 1.5 scaled to million-token contexts mixing text and vision, and Claude 3 showed that careful RLHF produces models that refuse to hallucinate about image content. MMMU and MMBench have become the standard evaluation gauntlet, but the real challenge is grounding — models still confabulate spatial relationships and struggle with fine-grained visual reasoning. This is the task that turned chatbots into visual assistants.
Image-Text-to-Video
Image-text-to-video is generative AI's hardest unsolved frontier — animating a still image according to a text prompt while maintaining temporal coherence and physical plausibility. Stable Video Diffusion (2023) and Runway Gen-2 showed early promise, Sora (2024) raised the bar dramatically with minute-long physically consistent clips, and Kling and Veo 2 pushed quality further. The fundamental challenge is that video generation requires implicit world models: objects must persist, lighting must evolve consistently, and motion must obey approximate physics across dozens of frames. Evaluation is still largely human-judged, with FVD and CLIP-temporal scores poorly correlating with perceived quality.
Visual Question Answering
Visual question answering (VQA) is the original multimodal reasoning task — given an image and a natural language question, produce the correct answer. VQAv2 (2017) defined the field, but modern benchmarks like GQA, OK-VQA, and TextVQA have pushed toward compositional reasoning, external knowledge, and OCR-dependent understanding. The task was largely "solved" in its classic form once multimodal LLMs arrived, with GPT-4V and Gemini saturating standard benchmarks, but adversarial and compositional variants still expose systematic failures in spatial reasoning and counting. VQA's legacy is establishing that vision-language models need more than pattern matching — they need genuine visual understanding.
Something wrong or missing?
Help keep Audio-Text-to-Text benchmarks accurate. Report outdated results, missing benchmarks, or errors.