Audio Classification
Classification of audio signals into predefined categories such as music genres, environmental sounds, or speaker identification.
Audio classification assigns labels to audio clips — identifying environmental sounds, music genres, speaker emotions, or acoustic events. Audio Spectrogram Transformer (AST) and BEATs brought ImageNet-scale pretraining to audio, achieving 98%+ accuracy on AudioSet for common sound events. The task powers content moderation, environmental monitoring, and smart device triggers.
History
ESC-50 dataset (Piczak) standardizes environmental sound classification with 50 categories
AudioSet (Google) provides 2M human-labeled 10-second clips spanning 632 sound event categories
PANNs (Kong et al.) achieve strong AudioSet performance with CNN14, establishing a practical baseline
Audio Spectrogram Transformer (AST, Gong et al.) adapts Vision Transformer to audio spectrograms, reaching 45.9 mAP on AudioSet
HTS-AT (Chen et al.) combines hierarchical token-semantic transformers, pushing AudioSet mAP to 47.1
BEATs (Microsoft) introduces audio pretraining with tokenized audio labels, achieving 50.6 mAP on AudioSet
CLAP (LAION) aligns audio and text in a shared embedding space, enabling zero-shot audio classification
EAT (Efficient Audio Transformer) and M2D push self-supervised audio pretraining with masked spectrogram modeling
Audio foundation models handle classification as one of many downstream tasks alongside captioning, QA, and retrieval
How Audio Classification Works
Audio preprocessing
Raw audio is converted to log-mel spectrograms (128 mel bins, 25ms windows) — treating audio as a 2D image
Patch embedding
The spectrogram is divided into fixed-size patches (e.g., 16x16) and linearly projected to token embeddings
Transformer encoding
Self-attention layers process patch tokens, capturing both local frequency patterns and long-range temporal structure
Classification
A [CLS] token or mean-pooled representation is fed to a multi-label classification head (sigmoid for multi-label, softmax for single-label)
Aggregation
For clips longer than the model context, predictions from multiple windows are aggregated (max-pooling or attention-weighted)
Current Landscape
Audio classification in 2025 has been transformed by the same self-supervised pretraining revolution that reshaped NLP and vision. Vision Transformer-based architectures (AST, BEATs, EAT) treat spectrograms as images and leverage ImageNet pretraining or masked audio modeling. AudioSet remains the central benchmark, but its noisy labels make progress hard to measure precisely. CLAP has opened up zero-shot classification, analogous to CLIP's impact on vision. Production deployments use lighter CNN models (YAMNet, PANNs) for latency-sensitive applications while transformer models handle quality-critical offline classification.
Key Challenges
Class imbalance in AudioSet: common sounds (speech, music) have 100x more examples than rare events (gunshots, glass breaking)
Noisy labels: AudioSet annotations are crowd-sourced and contain ~15-20% label noise, capping effective model accuracy
Real-world audio contains overlapping events — a single clip may have speech, music, and traffic simultaneously
Domain shift between AudioSet (YouTube clips) and deployment environments (surveillance, IoT, medical)
Temporal resolution: classifying when events occur within a clip (sound event detection) is harder than clip-level classification
Quick Recommendations
Best accuracy (AudioSet)
BEATs or EAT
50+ mAP on AudioSet; self-supervised pretraining captures rich audio representations
Zero-shot audio classification
CLAP (LAION) or Whisper-AT
Classify audio with arbitrary text descriptions without task-specific training
Production (lightweight)
PANNs CNN14 or YAMNet
Efficient CNN-based classifiers that run in real-time on edge devices
Environmental sound monitoring
AST fine-tuned on ESC-50 or UrbanSound8K
97%+ accuracy on environmental sound classification benchmarks
Music classification
MERT or MusicNN
Specialized for music genre, mood, and instrument recognition tasks
What's Next
Expect audio classification to merge into general audio understanding — a single model that classifies, captions, answers questions about, and retrieves audio. Fine-grained temporal event detection (not just 'this clip contains a dog bark' but 'a dog barks at 2.3 seconds for 0.5 seconds') will improve through frame-level models. On-device classification for smart home, wearable, and IoT applications will drive efficient architectures under 5M parameters.
Benchmarks & SOTA
VocalSound
The VocalSound dataset is a collection of over 21,000 crowdsourced audio recordings of non-speech human vocalizations, including laughter, coughs, sneezes, sighs, throat clearing, and sniffs. It was created to improve vocal sound recognition models and contains metadata about the speakers, such as age, gender, native language, and country. The dataset is designed to help researchers develop more robust and accurate systems for tasks like automatic transcription and health monitoring.
No results tracked yet
ESC-50
The ESC-50 dataset is a labeled collection of 2000 environmental audio recordings suitable for benchmarking methods of environmental sound classification.
No results tracked yet
GTZAN Genre
The GTZAN Genre dataset is a benchmark collection of 1,000 audio tracks used for music genre classification tasks. It contains 100 tracks for each of 10 genres, with each track being a 30-second .wav file recorded at 22,050 Hz, mono, 16-bit format. This widely used dataset has been instrumental in the development of music information retrieval (MIR) systems, although it has known issues like mislabeled tracks.
No results tracked yet
Speech Command V2
The Speech Command V2 dataset is an audio collection of 65,000 one-second clips covering a core set of wake words for keyword spotting systems. It includes 20 core command words like "yes," "no," and "go," as well as 10 auxiliary words such as "marvin" and "wow," plus background noise files recorded by thousands of contributors through crowdsourcing and released under a Creative Commons BY 4.0 license. The dataset has two versions (= configurations): "v0.01" and "v0.02". "v0.02" contains more words (see section Source Data for more details). train validation test v0.01 51093 6799 3081 v0.02 84848 9982 4890
No results tracked yet
AudioSet
AudioSet
2M+ human-labeled 10-second YouTube video clips covering 632 audio event classes.
No results tracked yet
ESC-50
Environmental Sound Classification 50
2,000 environmental audio recordings organized into 50 classes (animals, natural soundscapes, etc.).
No results tracked yet
Related Tasks
Audio-Language Models
Audio-Language Models (ALMs) are a form of artificial intelligence that extend natural language processing (NLP) to the domain of audio, enabling computers to understand, generate, and reason about sounds and speech by integrating audio data with language understanding. Trained on audio-text data, ALMs bridge the gap between acoustic signals and linguistic meaning, allowing for tasks like zero-shot audio recognition, audio captioning, and the creation of generative audio, such as text-to-audio synthesis.
Voice cloning
Voice cloning is a type of audio deepfake technology that uses machine learning to create a digital replica of a specific person's voice, synthesizing spoken audio that mimics their vocal characteristics like pitch and tone. While it has positive uses, such as generating audiobooks or helping people who have lost their voice, it is also used for malicious purposes, including creating convincing scams where fraudsters impersonate individuals.
Automatic Speech Recognition
Automatic Speech Recognition (ASR) is the technology that converts spoken language into written text. ASR systems process audio signals containing human speech and transcribe them into readable text format. These systems use acoustic models, language models, and often neural networks to recognize phonemes, words, and sentences from audio input. ASR is foundational for applications like voice assistants (Siri, Alexa), transcription services, voice-controlled systems, and accessibility tools for the hearing impaired.
Text-to-speech
Text-to-speech (TTS) is technology that converts written text into natural-sounding audio, also known as "read aloud" technology or speech synthesis. It works by analyzing text to understand words, punctuation, and sentence structure, then generating phonetic representations of those words before synthesizing them into a human-like voice output. TTS is a crucial form of assistive technology and a key component of natural language processing, making digital content accessible and improving user interaction in numerous applications.
Get notified when these results update
New models drop weekly. We track them so you don't have to.
Something wrong or missing?
Help keep Audio Classification benchmarks accurate. Report outdated results, missing benchmarks, or errors.