ESC-50

Unknown

2,000 environmental audio recordings organized into 50 classes (animals, natural soundscapes, etc.).

Benchmark Stats

Models4
Papers4
Metrics1

SOTA History

Only 4 models on this benchmark

Help build the community leaderboard — submit your model results.

accuracy

accuracy

Higher is better

RankModelSourceScoreYearPaper
1BEATs

BEATs iterative self-labeling (Chen et al., Microsoft, ICML 2023). 98.1% 5-fold CV on ESC-50. Stated in abstract.

Community98.12023Source
2HTS-AT

HTS-AT (Chen et al., ICASSP 2022). 97.0 ± 0.2% 5-fold CV on ESC-50. Verified from GitHub repo and search results.

Community972022Source
3AST

AST (Audio Spectrogram Transformer, Gong et al., MIT, INTERSPEECH 2021). 95.6% 5-fold CV on ESC-50. From abstract.

Community95.62021Source
4CLAP

CLAP (Contrastive Language-Audio Pretraining, Wu et al., ICASSP 2023). Zero-shot classification accuracy on ESC-50.

Community93.72023Source

Submit a Result