AudioSet.

Name: AudioSet Benchmark Results
Creator: Unknown
License: https://creativecommons.org/licenses/by/4.0/

2M+ human-labeled 10-second YouTube video clips covering 632 audio event classes.

Paper ↗Leaderboard ↓

§ 01 · SOTA history

Year over year.

§ 02 · Leaderboard

Results by metric.

Only 4 models on this benchmark

Help build the community leaderboard — submit your model results.

map

Higher is better

Rank	Model	Source	Score	Year	Paper
1	BEATs BEATs iterative self-labeling (Chen et al., Microsoft, ICML 2023). mAP 50.6% on AudioSet eval set. From abstract: "new SOTA mAP 50.6% on AudioSet-2M".	Community	0.51	2023	Source
2	AST AST (Audio Spectrogram Transformer, Gong et al., MIT, INTERSPEECH 2021). mAP 0.485 on AudioSet eval set. From abstract.	Community	0.48	2021	Source
3	HTS-AT HTS-AT (Chen et al., ICASSP 2022). mAP 0.471 on AudioSet eval set. Outperformed AST (0.459→0.485 in AST paper, HTS-AT reports 0.471 outperforming prior SOTA).	Community	0.47	2022	Source
4	CLAP CLAP (Wu et al., ICASSP 2023). mAP 0.428 on AudioSet eval set.	Community	0.43	2023	Source

Lineage

AudioSet in context.

See full audio understanding benchmarks lineage →

Predecessors (1)

saturated2015-01

ESC-50

AudioSet replaced ESC-50 as the primary audio classification benchmark — 527 classes vs 50, 2M clips vs 2K, hierarchical ontology. Scale and coverage made it the ImageNet analogue for audio. ESC-50 became a probe task for pretrained representations.

This benchmark (1)

saturating2017-03

AudioSet

Successors (1)

active2020-01

Clotho

Clotho shifted the evaluation task from classification (what sounds are here?) to captioning (describe these sounds in a sentence). A scope shift enabled by the growing capability of audio encoders trained on AudioSet.

§ 04 · Submit a result

Add to the leaderboard.

← Back to Audio Classification