Audio Classification
Audio classification identifies what's happening in a sound — music genre, environmental sounds, speaker emotion, language identification — and underpins everything from content moderation to smart home devices. Audio Spectrogram Transformer (AST) and BEATs brought ImageNet-level transfer learning to audio by treating spectrograms as images, achieving >95% accuracy on AudioSet's 527-class ontology. The paradigm shifted with audio foundation models like CLAP (contrastive language-audio pretraining) and Whisper's encoder, which provide general-purpose audio representations that transfer to downstream tasks with minimal fine-tuning. The hard problems remain fine-grained classification in noisy real-world conditions, rare sound event detection with few examples, and efficient on-device inference for always-listening applications.
AudioSet
2M+ human-labeled 10-second YouTube video clips covering 632 audio event classes.
Top 10
Leading models on AudioSet.
What were you looking for on Audio Classification?
Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.
All datasets
2 datasets tracked for this task.
Related tasks
Other tasks in Audio.
Didn't find what you came for?
Still looking for something on Audio Classification? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.
Real humans read every message. We track what people are asking for and prioritize accordingly.