What is the best audio classification model in 2025?

Audio Spectrogram Transformer (AST) and BEATs achieve state-of-the-art results on AudioSet with mAP scores above 0.47. For ESC-50, models like CLAP and PANNs achieve over 95% accuracy.

What is AudioSet and why is it important?

AudioSet is Google's large-scale audio dataset containing over 2 million human-labeled 10-second YouTube clips across 632 audio event classes. It's the gold standard for audio classification and serves as the ImageNet equivalent for audio AI.

How is audio classification performance measured?

Audio classification is primarily measured using mean Average Precision (mAP) for multi-label datasets like AudioSet, and accuracy for single-label datasets like ESC-50. AUC (Area Under Curve) is also commonly reported.

What is a mel spectrogram?

A mel spectrogram is a visual representation of audio that shows frequency content over time, with frequencies mapped to the mel scale which better represents human perception of pitch. It's the standard input format for most modern audio AI models.

Audio AI Benchmark

Understanding
Audio Intelligence

From classifying environmental sounds to generating music, audio AI has evolved rapidly. Compare models on AudioSet, ESC-50, and explore the cutting edge of sound understanding.

Classification Models Music Generation

Benchmark Stats

0.498

Best mAP (AudioSet)

98.1%

Best Accuracy (ESC-50)

632

AudioSet Classes

Audio Classification

AudioSet Leaderboard

Mean Average Precision on AudioSet evaluation set. Higher is better.

Rank	Model	mAP	Architecture	Type	Year
#1	BEATs Microsoft	0.498	Audio Tokenizer + Transformer	Open Source	2023
#2	Audio Spectrogram Transformer (AST) MIT/IBM	0.485	Vision Transformer	Open Source	2021
#3	HTS-AT Bytedance	0.471	Hierarchical Token-Semantic Audio Transformer	Open Source	2022
#4	CLAP LAION/Microsoft	0.463	Contrastive Learning	Open Source	2023
#5	PANNs (CNN14) ByteDance	0.431	CNN	Open Source	2020
#6	Wav2Vec 2.0 Meta	0.392	Self-supervised	Open Source	2020

ESC-50 Leaderboard

Accuracy on Environmental Sound Classification (50 classes, 5-fold cross-validation). Higher is better.

Rank	Model	Accuracy (%)	Type	Year
#1	BEATs Microsoft	98.1	Open Source	2023
#2	CLAP LAION/Microsoft	96.7	Open Source	2023
#3	AST MIT/IBM	95.6	Open Source	2021
#4	PANNs ByteDance	94.7	Open Source	2020
#5	wav2vec 2.0 + Linear Meta	92.3	Open Source	2020

Music Generation

Music Generation Models

Comparison of text-to-music and audio generation models. Quality assessed via community consensus and published evaluations.

Model	Quality	Key Features	Type	Year
Suno v3.5 Suno	Excellent	Full songs with vocals, lyrics generation	Cloud API	2024
Udio Udio	Excellent	High-quality vocals, genre diversity	Cloud API	2024
MusicGen Meta	Good	Text-to-music, melody conditioning	Open Source	2023
Stable Audio 2.0 Stability AI	Good	Long-form generation, audio-to-audio	Open Source	2024
AudioCraft Meta	Good	MusicGen + AudioGen combined	Open Source	2023
Riffusion Community	Fair	Spectrogram diffusion	Open Source	2023

Audio Captioning & Understanding

Audio Understanding Models

Models for audio captioning, audio question answering, and general audio understanding.

Model	Performance	Key Features	Type	Year
Qwen2-Audio Alibaba	SOTA	Multimodal LLM with audio understanding	Open Source	2024
SALMONN Tencent	Excellent	Speech + Audio LLM	Open Source	2024
Whisper-AT OpenAI/Community	Good	Audio tagging with Whisper encoder	Open Source	2023
CLAP + GPT Various	Good	Embeddings + LLM generation	Hybrid	2023

Contribute to Audio AI

Have you achieved better results on AudioSet or ESC-50? Working on novel audio generation models? Help the community by sharing your verified results.

Submit Benchmark Browse Audio Tasks

Understanding Audio Intelligence