Audio Classification Benchmark
The Sound
Classification Challenge
AudioSet is the ImageNet of audio: 2M+ clips across 632 sound classes. Understanding how models learn to hear is the foundation of audio AI.
AudioSet Stats
2,084,320
Total Clips (10s each)
0.498
Current SOTA (mAP)
632
Sound Classes
AudioSet Leaderboard
Detailed comparison on AudioSet eval set. All models use 10-second input clips with mel spectrogram features.
| Model | mAP | mAUC | Params | Pretraining | Year |
|---|---|---|---|---|---|
#1 BEATs iter3+ AS2M Microsoft | 0.498 | 0.975 | 90M | AudioSet-2M self-supervised | 2023 |
#2 AST (AudioSet + ImageNet) MIT/IBM | 0.485 | 0.972 | 87M | ImageNet-21k + AudioSet | 2021 |
#3 EfficientAT-M2 TU Munich | 0.476 | 0.971 | 30M | ImageNet + AudioSet | 2023 |
#4 HTS-AT ByteDance | 0.471 | 0.970 | 31M | AudioSet | 2022 |
#5 CLAP (HTSAT-base) LAION/Microsoft | 0.463 | 0.968 | 86M | LAION-Audio-630K | 2023 |
#6 PANNs CNN14 ByteDance | 0.431 | 0.963 | 81M | AudioSet from scratch | 2020 |
BEATs iter3+ AS2M
Microsoft (2023)
Iterative audio pre-training with discrete tokenization
Audio Tokenizer + Transformer90M params
AST (AudioSet + ImageNet)
MIT/IBM (2021)
First pure attention model for audio, no convolutions
Vision Transformer (ViT-B/16)87M params
EfficientAT-M2
TU Munich (2023)
Best efficiency, real-time capable on edge devices
EfficientNet + Mel-adapted30M params
HTS-AT
ByteDance (2022)
Efficient hierarchical structure, good compute/accuracy tradeoff
Swin Transformer + Token-Semantic31M params
ESC-50 Benchmark
| Rank | Model | Accuracy (%) | Params | Pretraining | Year |
|---|---|---|---|---|---|
| #1 | BEATs Microsoft | 98.1 | 90M | AudioSet-2M | 2023 |
| #2 | SSAST MIT/IBM | 96.8 | 89M | AudioSet + LibriSpeech | 2022 |
| #3 | CLAP LAION | 96.7 | 86M | LAION-Audio-630K | 2023 |
| #4 | AST MIT/IBM | 95.6 | 87M | ImageNet + AudioSet | 2021 |
| #5 | PANNs CNN14 ByteDance | 94.7 | 81M | AudioSet | 2020 |
| #6 | Wav2Vec 2.0 + Linear Meta | 92.3 | 317M | LibriLight 60k hours | 2020 |
Contribute to Audio Classification
Have you achieved better results on AudioSet or ESC-50? Benchmarked a new architecture? Help the community by sharing your verified results.