Audioaudio-classification

Audio Classification

Audio classification identifies what's happening in a sound — music genre, environmental sounds, speaker emotion, language identification — and underpins everything from content moderation to smart home devices. Audio Spectrogram Transformer (AST) and BEATs brought ImageNet-level transfer learning to audio by treating spectrograms as images, achieving >95% accuracy on AudioSet's 527-class ontology. The paradigm shifted with audio foundation models like CLAP (contrastive language-audio pretraining) and Whisper's encoder, which provide general-purpose audio representations that transfer to downstream tasks with minimal fine-tuning. The hard problems remain fine-grained classification in noisy real-world conditions, rare sound event detection with few examples, and efficient on-device inference for always-listening applications.

2
Datasets
8
Results
map
Canonical metric
Canonical Benchmark

AudioSet

2M+ human-labeled 10-second YouTube video clips covering 632 audio event classes.

Primary metric: map
View full leaderboard

Top 10

Leading models on AudioSet.

RankModelmapYearSource
1
BEATs
0.5062023paper
2
AST
0.4852021paper
3
HTS-AT
0.4712022paper
4
CLAP
0.4282023paper

What were you looking for on Audio Classification?

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

All datasets

2 datasets tracked for this task.

Related tasks

Other tasks in Audio.

Reply within 48 hours · No newsletter

Didn't find what you came for?

Still looking for something on Audio Classification? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.