Audio Classification Benchmark

The Sound
Classification Challenge

AudioSet is the ImageNet of audio: 2M+ clips across 632 sound classes. Understanding how models learn to hear is the foundation of audio AI.

AudioSet Stats

2,084,320
Total Clips (10s each)
0.498
Current SOTA (mAP)
632
Sound Classes

AudioSet Leaderboard

Detailed comparison on AudioSet eval set. All models use 10-second input clips with mel spectrogram features.

ModelmAPmAUCParamsPretrainingYear
0.4980.97590MAudioSet-2M self-supervised2023
0.4850.97287MImageNet-21k + AudioSet2021
#3
0.4760.97130MImageNet + AudioSet2023
#4
HTS-AT
ByteDance
0.4710.97031MAudioSet2022
#5
CLAP (HTSAT-base)
LAION/Microsoft
0.4630.96886MLAION-Audio-630K2023
#6
PANNs CNN14
ByteDance
0.4310.96381MAudioSet from scratch2020

BEATs iter3+ AS2M

Microsoft (2023)
0.498

Iterative audio pre-training with discrete tokenization

Audio Tokenizer + Transformer90M params

AST (AudioSet + ImageNet)

MIT/IBM (2021)
0.485

First pure attention model for audio, no convolutions

Vision Transformer (ViT-B/16)87M params

EfficientAT-M2

TU Munich (2023)
0.476

Best efficiency, real-time capable on edge devices

EfficientNet + Mel-adapted30M params

HTS-AT

ByteDance (2022)
0.471

Efficient hierarchical structure, good compute/accuracy tradeoff

Swin Transformer + Token-Semantic31M params

ESC-50 Benchmark

RankModelAccuracy (%)ParamsPretrainingYear
#1
BEATs
Microsoft
98.190MAudioSet-2M2023
#2
SSAST
MIT/IBM
96.889MAudioSet + LibriSpeech2022
#3
CLAP
LAION
96.786MLAION-Audio-630K2023
#4
AST
MIT/IBM
95.687MImageNet + AudioSet2021
#5
PANNs CNN14
ByteDance
94.781MAudioSet2020
#6
Wav2Vec 2.0 + Linear
Meta
92.3317MLibriLight 60k hours2020

Contribute to Audio Classification

Have you achieved better results on AudioSet or ESC-50? Benchmarked a new architecture? Help the community by sharing your verified results.