2M+ human-labeled 10-second YouTube video clips covering 632 audio event classes.
map
Higher is better
| Rank | Model | Source | Score | Year | Paper |
|---|---|---|---|---|---|
| 1 | BEATs BEATs iterative self-labeling (Chen et al., Microsoft, ICML 2023). mAP 50.6% on AudioSet eval set. From abstract: "new SOTA mAP 50.6% on AudioSet-2M". | Community | 0.51 | 2023 | Source |
| 2 | AST AST (Audio Spectrogram Transformer, Gong et al., MIT, INTERSPEECH 2021). mAP 0.485 on AudioSet eval set. From abstract. | Community | 0.48 | 2021 | Source |
| 3 | HTS-AT HTS-AT (Chen et al., ICASSP 2022). mAP 0.471 on AudioSet eval set. Outperformed AST (0.459→0.485 in AST paper, HTS-AT reports 0.471 outperforming prior SOTA). | Community | 0.47 | 2022 | Source |
| 4 | CLAP CLAP (Wu et al., ICASSP 2023). mAP 0.428 on AudioSet eval set. | Community | 0.43 | 2023 | Source |