Audio AI Benchmark
Understanding
Audio Intelligence
From classifying environmental sounds to generating music, audio AI has evolved rapidly. Compare models on AudioSet, ESC-50, and explore the cutting edge of sound understanding.
Benchmark Stats
0.498
Best mAP (AudioSet)
98.1%
Best Accuracy (ESC-50)
632
AudioSet Classes
Audio Classification
AudioSet Leaderboard
Mean Average Precision on AudioSet evaluation set. Higher is better.
| Rank | Model | mAP | Architecture | Type | Year |
|---|---|---|---|---|---|
| #1 | BEATs Microsoft | 0.498 | Audio Tokenizer + Transformer | Open Source | 2023 |
| #2 | Audio Spectrogram Transformer (AST) MIT/IBM | 0.485 | Vision Transformer | Open Source | 2021 |
| #3 | HTS-AT Bytedance | 0.471 | Hierarchical Token-Semantic Audio Transformer | Open Source | 2022 |
| #4 | CLAP LAION/Microsoft | 0.463 | Contrastive Learning | Open Source | 2023 |
| #5 | PANNs (CNN14) ByteDance | 0.431 | CNN | Open Source | 2020 |
| #6 | Wav2Vec 2.0 Meta | 0.392 | Self-supervised | Open Source | 2020 |
ESC-50 Leaderboard
Accuracy on Environmental Sound Classification (50 classes, 5-fold cross-validation). Higher is better.
| Rank | Model | Accuracy (%) | Type | Year |
|---|---|---|---|---|
| #1 | BEATs Microsoft | 98.1 | Open Source | 2023 |
| #2 | CLAP LAION/Microsoft | 96.7 | Open Source | 2023 |
| #3 | AST MIT/IBM | 95.6 | Open Source | 2021 |
| #4 | PANNs ByteDance | 94.7 | Open Source | 2020 |
| #5 | wav2vec 2.0 + Linear Meta | 92.3 | Open Source | 2020 |
Music Generation
Music Generation Models
Comparison of text-to-music and audio generation models. Quality assessed via community consensus and published evaluations.
| Model | Quality | Key Features | Type | Year |
|---|---|---|---|---|
Suno v3.5 Suno | Excellent | Full songs with vocals, lyrics generation | Cloud API | 2024 |
Udio Udio | Excellent | High-quality vocals, genre diversity | Cloud API | 2024 |
MusicGen Meta | Good | Text-to-music, melody conditioning | Open Source | 2023 |
Stable Audio 2.0 Stability AI | Good | Long-form generation, audio-to-audio | Open Source | 2024 |
AudioCraft Meta | Good | MusicGen + AudioGen combined | Open Source | 2023 |
Riffusion Community | Fair | Spectrogram diffusion | Open Source | 2023 |
Audio Captioning & Understanding
Audio Understanding Models
Models for audio captioning, audio question answering, and general audio understanding.
| Model | Performance | Key Features | Type | Year |
|---|---|---|---|---|
Qwen2-Audio Alibaba | SOTA | Multimodal LLM with audio understanding | Open Source | 2024 |
SALMONN Tencent | Excellent | Speech + Audio LLM | Open Source | 2024 |
Whisper-AT OpenAI/Community | Good | Audio tagging with Whisper encoder | Open Source | 2023 |
CLAP + GPT Various | Good | Embeddings + LLM generation | Hybrid | 2023 |
Contribute to Audio AI
Have you achieved better results on AudioSet or ESC-50? Working on novel audio generation models? Help the community by sharing your verified results.