Audio AI Benchmark

Understanding
Audio Intelligence

From classifying environmental sounds to generating music, audio AI has evolved rapidly. Compare models on AudioSet, ESC-50, and explore the cutting edge of sound understanding.

Benchmark Stats

0.498
Best mAP (AudioSet)
98.1%
Best Accuracy (ESC-50)
632
AudioSet Classes

Audio Classification

AudioSet Leaderboard

Mean Average Precision on AudioSet evaluation set. Higher is better.

RankModelmAPArchitectureTypeYear
#1
BEATs
Microsoft
0.498Audio Tokenizer + TransformerOpen Source2023
#2
Audio Spectrogram Transformer (AST)
MIT/IBM
0.485Vision TransformerOpen Source2021
#3
HTS-AT
Bytedance
0.471Hierarchical Token-Semantic Audio TransformerOpen Source2022
#4
CLAP
LAION/Microsoft
0.463Contrastive LearningOpen Source2023
#5
PANNs (CNN14)
ByteDance
0.431CNNOpen Source2020
#6
Wav2Vec 2.0
Meta
0.392Self-supervisedOpen Source2020

ESC-50 Leaderboard

Accuracy on Environmental Sound Classification (50 classes, 5-fold cross-validation). Higher is better.

RankModelAccuracy (%)TypeYear
#1
BEATs
Microsoft
98.1Open Source2023
#2
CLAP
LAION/Microsoft
96.7Open Source2023
#3
AST
MIT/IBM
95.6Open Source2021
#4
PANNs
ByteDance
94.7Open Source2020
#5
wav2vec 2.0 + Linear
Meta
92.3Open Source2020

Music Generation

Music Generation Models

Comparison of text-to-music and audio generation models. Quality assessed via community consensus and published evaluations.

ModelQualityKey FeaturesTypeYear
Suno v3.5
Suno
ExcellentFull songs with vocals, lyrics generationCloud API2024
Udio
Udio
ExcellentHigh-quality vocals, genre diversityCloud API2024
MusicGen
Meta
GoodText-to-music, melody conditioningOpen Source2023
Stable Audio 2.0
Stability AI
GoodLong-form generation, audio-to-audioOpen Source2024
AudioCraft
Meta
GoodMusicGen + AudioGen combinedOpen Source2023
Riffusion
Community
FairSpectrogram diffusionOpen Source2023

Audio Captioning & Understanding

Audio Understanding Models

Models for audio captioning, audio question answering, and general audio understanding.

ModelPerformanceKey FeaturesTypeYear
Qwen2-Audio
Alibaba
SOTAMultimodal LLM with audio understandingOpen Source2024
SALMONN
Tencent
ExcellentSpeech + Audio LLMOpen Source2024
Whisper-AT
OpenAI/Community
GoodAudio tagging with Whisper encoderOpen Source2023
CLAP + GPT
Various
GoodEmbeddings + LLM generationHybrid2023

Contribute to Audio AI

Have you achieved better results on AudioSet or ESC-50? Working on novel audio generation models? Help the community by sharing your verified results.