Audio-Language Models
Audio-Language Models (ALMs) are a form of artificial intelligence that extend natural language processing (NLP) to the domain of audio, enabling computers to understand, generate, and reason about sounds and speech by integrating audio data with language understanding. Trained on audio-text data, ALMs bridge the gap between acoustic signals and linguistic meaning, allowing for tasks like zero-shot audio recognition, audio captioning, and the creation of generative audio, such as text-to-audio synthesis.
Audio-Language Models is a key task in audio. Below you will find the standard benchmarks used to evaluate models, along with current state-of-the-art results.
Benchmarks & SOTA
MMSU
MMSU is a dataset comprising 5,000 audio-question-answer triplets across 47 distinct tasks. It incorporates a wide range of linguistic phenomena, including phonetics, prosody, syntax, syntactics, semantics, and paralinguistics, and is a comprehensive multi-task spoken language understanding and reasoning benchmark.
No results tracked yet
MMAU
The MMAU dataset is for audio-language models and is a Massive Multi-Task Audio Understanding and Reasoning Benchmark. It features 10,000 human-annotated audio-question-response pairs and encompasses 27 distinct skills across unique tasks.
No results tracked yet
AudioCaps
AudioCaps is a dataset used for audio-language models, specifically for audio captioning. It contains audio-text data, with a training set of 49,838 captions for audio clips ranging from 0.5 to 10 seconds in duration.
No results tracked yet
Clotho-AQA
Clotho-AQA is a crowdsourced dataset for Audio Question Answering. It supports audio-language models, and the dataset is freely available online.
No results tracked yet
MMAR
MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix
A benchmark designed to evaluate the deep reasoning capabilities of Audio-Language Models (ALMs) across massive multi-disciplinary tasks. Comprises 1,000 meticulously curated audio-question-answer triplets covering speech, audio, music, and their mix collected from real-world internet videos.
No results tracked yet
CMM hallucination
CMM is a curated benchmark designed to evaluate hallucination vulnerabilities in Large Multi-Modal Models (LMMs). It is constructed to rigorously test LMMs’ capabilities across visual, audio, and language modalities, focusing on hallucinations arising from inter-modality spurious correlations and uni-modal over-reliance.
No results tracked yet
CompA-R-test
CompA-R-test is a human-labeled evaluation dataset for evaluating the capabilities of large audio language models on open-ended audio question-answering that requires complex reasoning.
No results tracked yet
MusicAVQA
MusicAVQA is a benchmark for Audio-Visual Question Answering, which aims to answer questions regarding different visual objects, sounds, and their associations in videos. The problem requires comprehensive multimodal understanding and spatio-temporal reasoning over audio-visual scenes.
No results tracked yet
NSynth
The NSynth dataset contains four-second 16 kHz audio snippets for each instrument, ranging over every pitch of a standard MIDI piano and five different velocities. It's designed as a benchmark for audio machine learning and a foundation for future datasets. It includes information on the source and family of sound production for each instrument, as well as features like pitch, velocity, sample rate, and sonic qualities. The dataset has train, valid, and test splits, with instruments not overlapping between these splits.
No results tracked yet
Music Instruct
The Music Instruct dataset contains Q&A pairs related to individual musical compositions, specifically tailored for open-ended music queries. It originates from the music-caption pairs in the MusicCaps dataset. The MI dataset was created through prompt engineering and applying few-shot learning techniques to GPT-4.
No results tracked yet
MuchoMusic
MuchoMusic is a benchmark for evaluating music understanding in multimodal audio-language models.
No results tracked yet
LibriSQA
The LibriSQA dataset is a novel dataset and framework for spoken question answering with large language models. It proposes a lightweight, end-to-end framework that seamlessly integrates both speech and text into LLMs, eliminating the need for ASR modules.
No results tracked yet
LongAudioBench
LongAudioBench was proposed in the Audio Flamingo 2 paper. It is based on LongAudio, which consists of over 80K unique audios and approximately 263K AQA pairs, where audios are from open long-video datasets. LongAudio supports six different tasks including: Captioning, Plot QA, Temporal QA, Needle QA, Subscene QA, and General QA. LongAudioBench has 2429 expert human-annotated instances across these tasks.
No results tracked yet
Clotho-v2
Clotho is an audio captioning dataset, now reached version 2. Clotho consists of 6974 audio samples, and each audio sample has five captions (a total of 34 870 captions). Clotho v2 contains around 2000 more audio files (~40% increase compared to Clotho v1) introducing more training data and a new validation split. Evaluation and testing splits are not altered, i.e. evaluation and testing splits in v2 are kept the same as in v1. Each audio file is of 15-30 seconds long, having five captions of eight to 20 words.
No results tracked yet
IEMOCAP
The Interactive Emotional Dyadic Motion Capture (IEMOCAP) database is an acted, multimodal and multispeaker database, recently collected at SAIL lab at USC. It contains approximately 12 hours of audiovisual data, including video, speech, motion capture of face, text transcriptions. It consists of dyadic sessions where actors perform improvisations or scripted scenarios, specifically selected to elicit emotional expressions. IEMOCAP database is annotated by multiple annotators into categorical labels, such as anger, happiness, sadness, neutrality, as well as dimensional labels such as valence, activation and dominance.
No results tracked yet
CochlScene
CochlScene is a crowd-sourced acoustic scene dataset consisting of 76k samples collected across 13 distinct acoustic scenes. Each audio file is single-channel and 10 seconds long. It includes a manual data split for training, validation, and test sets to increase the reliability of evaluation results.
No results tracked yet
NonSpeech7k
The NonSpeech7k dataset consists of 7,014 files (6,289 for training, 725 for testing) delivered as 32kHz, mono audio files in .wav format. The files are approximately 4 seconds long, and the dataset has a total duration of 405 minutes. It contains 7 event classes: breath, cough, crying, laugh, screaming, sneeze, and yawn.
No results tracked yet
OpenAudioBench - LlamaQuestions
OpenAudioBench
OpenAudioBench is an audio understanding evaluation dataset published on Hugging Face by baichuan-inc. It is designed to benchmark multimodal and audio-focused language models across multiple audio-based tasks including logical reasoning, general-knowledge and open-ended/question-answering scenarios. The public Hugging Face dataset repo contains evaluation data directories (e.g., eval_datas/web_questions and eval_datas/reasoning_qa) with audio files and accompanying CSV metadata. The dataset on HF shows a default/test split with ~2.9k rows and audio durations in the release ranging roughly from ~1s up to ~50s. The dataset card and repo files (audio WAVs and CSVs) indicate it is intended as an evaluation collection for audio-driven QA and reasoning tasks; the subset referred to in your paper as the "LlamaQuestions" audio-driven question-answering task corresponds to the audio QA evaluation data included in this OpenAudioBench release. Author/owner: baichuan-inc. Hugging Face dataset page: https://huggingface.co/datasets/baichuan-inc/OpenAudioBench.
No results tracked yet
Related Tasks
Voice cloning
Voice cloning is a type of audio deepfake technology that uses machine learning to create a digital replica of a specific person's voice, synthesizing spoken audio that mimics their vocal characteristics like pitch and tone. While it has positive uses, such as generating audiobooks or helping people who have lost their voice, it is also used for malicious purposes, including creating convincing scams where fraudsters impersonate individuals.
Automatic Speech Recognition
Automatic Speech Recognition (ASR) is the technology that converts spoken language into written text. ASR systems process audio signals containing human speech and transcribe them into readable text format. These systems use acoustic models, language models, and often neural networks to recognize phonemes, words, and sentences from audio input. ASR is foundational for applications like voice assistants (Siri, Alexa), transcription services, voice-controlled systems, and accessibility tools for the hearing impaired.
Text-to-speech
Text-to-speech (TTS) is technology that converts written text into natural-sounding audio, also known as "read aloud" technology or speech synthesis. It works by analyzing text to understand words, punctuation, and sentence structure, then generating phonetic representations of those words before synthesizing them into a human-like voice output. TTS is a crucial form of assistive technology and a key component of natural language processing, making digital content accessible and improving user interaction in numerous applications.
Audio Classification
Classification of audio signals into predefined categories such as music genres, environmental sounds, or speaker identification.
Get notified when these results update
New models drop weekly. We track them so you don't have to.
Something wrong or missing?
Help keep Audio-Language Models benchmarks accurate. Report outdated results, missing benchmarks, or errors.