Codesota · Benchmark · AudioSetHome/Browse/Audio/Audio Classification/AudioSet
Unknown

AudioSet.

2M+ human-labeled 10-second YouTube video clips covering 632 audio event classes.

Paper Leaderboard
§ 01 · SOTA history

Year over year.

§ 02 · Leaderboard

Results by metric.

Only 4 models on this benchmark
Help build the community leaderboard — submit your model results.

map

map

Higher is better

RankModelSourceScoreYearPaper
1BEATs

BEATs iterative self-labeling (Chen et al., Microsoft, ICML 2023). mAP 50.6% on AudioSet eval set. From abstract: "new SOTA mAP 50.6% on AudioSet-2M".

Community0.512023Source
2AST

AST (Audio Spectrogram Transformer, Gong et al., MIT, INTERSPEECH 2021). mAP 0.485 on AudioSet eval set. From abstract.

Community0.482021Source
3HTS-AT

HTS-AT (Chen et al., ICASSP 2022). mAP 0.471 on AudioSet eval set. Outperformed AST (0.459→0.485 in AST paper, HTS-AT reports 0.471 outperforming prior SOTA).

Community0.472022Source
4CLAP

CLAP (Wu et al., ICASSP 2023). mAP 0.428 on AudioSet eval set.

Community0.432023Source
Lineage

AudioSet in context.

See full audio understanding benchmarks lineage →
This benchmark (1)
saturating2017-03
AudioSet
Successors (1)
active2020-01
Clotho
Clotho shifted the evaluation task from classification (what sounds are here?) to captioning (describe these sounds in a sentence). A scope shift enabled by the growing capability of audio encoders trained on AudioSet.
§ 04 · Submit a result

Add to the leaderboard.

← Back to Audio Classification