Audioaudio-classification

Audio Classification

Audio classification identifies what's happening in a sound — music genre, environmental sounds, speaker emotion, language identification — and underpins everything from content moderation to smart home devices. Audio Spectrogram Transformer (AST) and BEATs brought ImageNet-level transfer learning to audio by treating spectrograms as images, achieving >95% accuracy on AudioSet's 527-class ontology. The paradigm shifted with audio foundation models like CLAP (contrastive language-audio pretraining) and Whisper's encoder, which provide general-purpose audio representations that transfer to downstream tasks with minimal fine-tuning. The hard problems remain fine-grained classification in noisy real-world conditions, rare sound event detection with few examples, and efficient on-device inference for always-listening applications.

Datasets

Results

map

Canonical metric

Canonical Benchmark

AudioSet

2M+ human-labeled 10-second YouTube video clips covering 632 audio event classes.

Primary metric: map

View full leaderboard

Top 10

Leading models on AudioSet.

Rank	Model	map	Year	Source
1	BEATs	0.506	2023	paper
2	AST	0.485	2021	paper
3	HTS-AT	0.471	2022	paper
4	CLAP	0.428	2023	paper

What were you looking for on Audio Classification?

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

All datasets

2 datasets tracked for this task.

AudioSet

CANONICAL

4results·map

Top: BEATs — 0.506

ESC-50

4results·accuracy

Top: BEATs — 98.1

Related tasks

Other tasks in Audio.

Audio Captioning Audio-to-Audio Music Generation Sound Event Detection Text-to-Audio Voice Activity Detection

Reply within 48 hours · No newsletter

Didn't find what you came for?

Still looking for something on Audio Classification? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.