Multimodalaudio-text-to-text

Audio-Text-to-Text

Audio-text-to-text is the backbone of voice assistants that actually understand context — models that jointly process speech and text to generate grounded responses. Whisper (2022) cracked robust transcription, but the real leap came when Gemini 1.5 and GPT-4o (2024) began reasoning natively over audio tokens alongside text, eliminating the lossy ASR-then-LLM pipeline. The key challenges are handling overlapping speakers, noisy environments, and preserving prosodic cues like sarcasm or hesitation that pure transcription destroys. Benchmarks like SUPERB and Dynamic-SUPERB are expanding, but real-world spoken dialogue understanding remains far ahead of what leaderboards capture.

2 datasets0 resultsView full task mapping →

Audio-text-to-text models jointly process speech or audio alongside text prompts to produce text outputs — enabling spoken-language understanding, audio captioning, and speech-grounded Q&A. This task bridges the gap between ASR pipelines and true audio comprehension, where tone, music, and environmental sounds all carry meaning.

History

2020

wav2vec 2.0 demonstrates self-supervised speech representations that rival supervised ASR with minimal labels

2021

HuBERT introduces masked prediction pretraining for speech, achieving SOTA on LibriSpeech

2022

Whisper (OpenAI) ships a 1.5B-param multilingual ASR model trained on 680k hours of web audio

2023

AudioPaLM (Google) fuses PaLM-2 with audio tokens for speech-to-speech translation and understanding

2023

Qwen-Audio processes arbitrary audio inputs (speech, music, environmental sounds) with a unified encoder

2024

Gemini 1.5 Pro natively ingests long-form audio (up to hours) alongside text and images

2024

GPT-4o introduces real-time audio reasoning with sub-300ms latency in voice mode

2025

Qwen2-Audio and Gemini 2.0 push multilingual audio understanding with native multi-turn dialogue over audio

How Audio-Text-to-Text Works

1Audio EncodingRaw waveform is converted t…2Feature ProjectionAudio embeddings are projec…3Interleaved DecodingThe LLM processes the inter…4Post-processingOutput text may include tra…Audio-Text-to-Text Pipeline
1

Audio Encoding

Raw waveform is converted to a spectrogram or learned audio tokens via an encoder (Whisper encoder, audio-specific ViT, or HuBERT). The encoder captures phonetic, prosodic, and acoustic features.

2

Feature Projection

Audio embeddings are projected into the same embedding space as the LLM's text tokens, often via a lightweight adapter (linear projection, Q-Former, or cross-attention bridge).

3

Interleaved Decoding

The LLM processes the interleaved sequence of audio and text tokens autoregressively, attending over both modalities to generate text outputs.

4

Post-processing

Output text may include transcriptions, summaries, answers to questions about audio content, or structured metadata like speaker diarization and sentiment labels.

Current Landscape

The audio-text-to-text landscape in 2025 is rapidly consolidating around natively multimodal LLMs that can ingest raw audio without a separate ASR stage. Gemini 2.0 and GPT-4o lead the proprietary space with native audio understanding, while Qwen2-Audio dominates open-source. The old pipeline of ASR → NLP is giving way to end-to-end models that preserve prosody, tone, and non-speech audio cues. However, two-stage approaches (Whisper + LLM) remain competitive for pure transcription tasks and offer better controllability. Key differentiators now are long-context audio handling, real-time streaming support, and non-speech audio comprehension.

Key Challenges

Long-form audio context — encoding a 1-hour podcast into a manageable token budget without losing temporal detail

Speaker attribution and diarization — models struggle to track who said what in multi-speaker settings

Non-speech audio understanding — environmental sounds, music analysis, and acoustic scene classification lag behind speech tasks

Multilingual and code-switched audio — performance drops sharply on low-resource languages and mixed-language speech

Hallucination under noise — models confabulate transcriptions when audio is noisy or low-quality

Quick Recommendations

Best overall

Gemini 2.0 Flash

Natively processes long-form audio with text, strong multilingual support, fast inference, and competitive pricing

Best accuracy on speech tasks

GPT-4o

Highest accuracy on speech understanding benchmarks with real-time audio reasoning capabilities

Open source

Qwen2-Audio-7B-Instruct

Best open-weight audio-language model; handles speech, music, and sound effects in a unified architecture

ASR + understanding pipeline

Whisper Large-v3 + GPT-4o

Whisper handles robust transcription, GPT-4o reasons over the transcript — still the most reliable two-stage approach

On-device / edge

Whisper Small + Phi-3-mini

Sub-1GB total footprint, runs on mobile devices with acceptable latency for transcription + summarization

What's Next

The next frontier is real-time audio agents that can interrupt, ask clarifying questions, and maintain multi-turn conversations over streaming audio. Expect models to handle 10+ hour audio contexts natively, improve speaker diarization without external tools, and develop richer understanding of music, emotion, and acoustic environments. Audio-visual grounding — where models jointly process video and audio tracks — will become standard in multimodal systems.

Benchmarks & SOTA

Related Tasks

Image-Text-to-Image

Image-text-to-image covers instruction-guided image editing — taking a source image plus a text command and producing a modified result. InstructPix2Pix (2023) demonstrated this could work zero-shot, and subsequent models like DALL-E 3's inpainting, Ideogram, and Stable Diffusion's img2img pipelines made it practical. The core difficulty is surgical precision: users want "change the dress to red" without altering the face, background, or lighting, which requires disentangling content from style at a level current architectures still fumble. Benchmarks are fragmented across editing fidelity, instruction-following, and identity preservation, making unified comparison difficult.

Image-Text-to-Text

Image-text-to-text exploded from a research curiosity to the dominant AI interface in under two years. GPT-4V (2023) proved multimodal LLMs could reason over images, Gemini 1.5 scaled to million-token contexts mixing text and vision, and Claude 3 showed that careful RLHF produces models that refuse to hallucinate about image content. MMMU and MMBench have become the standard evaluation gauntlet, but the real challenge is grounding — models still confabulate spatial relationships and struggle with fine-grained visual reasoning. This is the task that turned chatbots into visual assistants.

Image-Text-to-Video

Image-text-to-video is generative AI's hardest unsolved frontier — animating a still image according to a text prompt while maintaining temporal coherence and physical plausibility. Stable Video Diffusion (2023) and Runway Gen-2 showed early promise, Sora (2024) raised the bar dramatically with minute-long physically consistent clips, and Kling and Veo 2 pushed quality further. The fundamental challenge is that video generation requires implicit world models: objects must persist, lighting must evolve consistently, and motion must obey approximate physics across dozens of frames. Evaluation is still largely human-judged, with FVD and CLIP-temporal scores poorly correlating with perceived quality.

Visual Question Answering

Visual question answering (VQA) is the original multimodal reasoning task — given an image and a natural language question, produce the correct answer. VQAv2 (2017) defined the field, but modern benchmarks like GQA, OK-VQA, and TextVQA have pushed toward compositional reasoning, external knowledge, and OCR-dependent understanding. The task was largely "solved" in its classic form once multimodal LLMs arrived, with GPT-4V and Gemini saturating standard benchmarks, but adversarial and compositional variants still expose systematic failures in spatial reasoning and counting. VQA's legacy is establishing that vision-language models need more than pattern matching — they need genuine visual understanding.

Something wrong or missing?

Help keep Audio-Text-to-Text benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000