Audioaudio-classification

Audio Classification

Classification of audio signals into predefined categories such as music genres, environmental sounds, or speaker identification.

6 datasets0 resultsView full task mapping →

Audio classification assigns labels to audio clips — identifying environmental sounds, music genres, speaker emotions, or acoustic events. Audio Spectrogram Transformer (AST) and BEATs brought ImageNet-scale pretraining to audio, achieving 98%+ accuracy on AudioSet for common sound events. The task powers content moderation, environmental monitoring, and smart device triggers.

History

2014

ESC-50 dataset (Piczak) standardizes environmental sound classification with 50 categories

2017

AudioSet (Google) provides 2M human-labeled 10-second clips spanning 632 sound event categories

2019

PANNs (Kong et al.) achieve strong AudioSet performance with CNN14, establishing a practical baseline

2021

Audio Spectrogram Transformer (AST, Gong et al.) adapts Vision Transformer to audio spectrograms, reaching 45.9 mAP on AudioSet

2022

HTS-AT (Chen et al.) combines hierarchical token-semantic transformers, pushing AudioSet mAP to 47.1

2023

BEATs (Microsoft) introduces audio pretraining with tokenized audio labels, achieving 50.6 mAP on AudioSet

2023

CLAP (LAION) aligns audio and text in a shared embedding space, enabling zero-shot audio classification

2024

EAT (Efficient Audio Transformer) and M2D push self-supervised audio pretraining with masked spectrogram modeling

2025

Audio foundation models handle classification as one of many downstream tasks alongside captioning, QA, and retrieval

How Audio Classification Works

1Audio preprocessingRaw audio is converted to l…2Patch embeddingThe spectrogram is divided …3Transformer encodingSelf-attention layers proce…4ClassificationA [CLS] token or mean-poole…5AggregationFor clips longer than the m…Audio Classification Pipeline
1

Audio preprocessing

Raw audio is converted to log-mel spectrograms (128 mel bins, 25ms windows) — treating audio as a 2D image

2

Patch embedding

The spectrogram is divided into fixed-size patches (e.g., 16x16) and linearly projected to token embeddings

3

Transformer encoding

Self-attention layers process patch tokens, capturing both local frequency patterns and long-range temporal structure

4

Classification

A [CLS] token or mean-pooled representation is fed to a multi-label classification head (sigmoid for multi-label, softmax for single-label)

5

Aggregation

For clips longer than the model context, predictions from multiple windows are aggregated (max-pooling or attention-weighted)

Current Landscape

Audio classification in 2025 has been transformed by the same self-supervised pretraining revolution that reshaped NLP and vision. Vision Transformer-based architectures (AST, BEATs, EAT) treat spectrograms as images and leverage ImageNet pretraining or masked audio modeling. AudioSet remains the central benchmark, but its noisy labels make progress hard to measure precisely. CLAP has opened up zero-shot classification, analogous to CLIP's impact on vision. Production deployments use lighter CNN models (YAMNet, PANNs) for latency-sensitive applications while transformer models handle quality-critical offline classification.

Key Challenges

Class imbalance in AudioSet: common sounds (speech, music) have 100x more examples than rare events (gunshots, glass breaking)

Noisy labels: AudioSet annotations are crowd-sourced and contain ~15-20% label noise, capping effective model accuracy

Real-world audio contains overlapping events — a single clip may have speech, music, and traffic simultaneously

Domain shift between AudioSet (YouTube clips) and deployment environments (surveillance, IoT, medical)

Temporal resolution: classifying when events occur within a clip (sound event detection) is harder than clip-level classification

Quick Recommendations

Best accuracy (AudioSet)

BEATs or EAT

50+ mAP on AudioSet; self-supervised pretraining captures rich audio representations

Zero-shot audio classification

CLAP (LAION) or Whisper-AT

Classify audio with arbitrary text descriptions without task-specific training

Production (lightweight)

PANNs CNN14 or YAMNet

Efficient CNN-based classifiers that run in real-time on edge devices

Environmental sound monitoring

AST fine-tuned on ESC-50 or UrbanSound8K

97%+ accuracy on environmental sound classification benchmarks

Music classification

MERT or MusicNN

Specialized for music genre, mood, and instrument recognition tasks

What's Next

Expect audio classification to merge into general audio understanding — a single model that classifies, captions, answers questions about, and retrieves audio. Fine-grained temporal event detection (not just 'this clip contains a dog bark' but 'a dog barks at 2.3 seconds for 0.5 seconds') will improve through frame-level models. On-device classification for smart home, wearable, and IoT applications will drive efficient architectures under 5M parameters.

Benchmarks & SOTA

VocalSound

0 results

The VocalSound dataset is a collection of over 21,000 crowdsourced audio recordings of non-speech human vocalizations, including laughter, coughs, sneezes, sighs, throat clearing, and sniffs. It was created to improve vocal sound recognition models and contains metadata about the speakers, such as age, gender, native language, and country. The dataset is designed to help researchers develop more robust and accurate systems for tasks like automatic transcription and health monitoring.

No results tracked yet

ESC-50

0 results

The ESC-50 dataset is a labeled collection of 2000 environmental audio recordings suitable for benchmarking methods of environmental sound classification.

No results tracked yet

GTZAN Genre

0 results

The GTZAN Genre dataset is a benchmark collection of 1,000 audio tracks used for music genre classification tasks. It contains 100 tracks for each of 10 genres, with each track being a 30-second .wav file recorded at 22,050 Hz, mono, 16-bit format. This widely used dataset has been instrumental in the development of music information retrieval (MIR) systems, although it has known issues like mislabeled tracks.

No results tracked yet

Speech Command V2

0 results

The Speech Command V2 dataset is an audio collection of 65,000 one-second clips covering a core set of wake words for keyword spotting systems. It includes 20 core command words like "yes," "no," and "go," as well as 10 auxiliary words such as "marvin" and "wow," plus background noise files recorded by thousands of contributors through crowdsourcing and released under a Creative Commons BY 4.0 license. The dataset has two versions (= configurations): "v0.01" and "v0.02". "v0.02" contains more words (see section Source Data for more details). train validation test v0.01 51093 6799 3081 v0.02 84848 9982 4890

No results tracked yet

AudioSet

AudioSet

20170 results

2M+ human-labeled 10-second YouTube video clips covering 632 audio event classes.

No results tracked yet

ESC-50

Environmental Sound Classification 50

20150 results

2,000 environmental audio recordings organized into 50 classes (animals, natural soundscapes, etc.).

No results tracked yet

Related Tasks

Audio-Language Models

Audio-Language Models (ALMs) are a form of artificial intelligence that extend natural language processing (NLP) to the domain of audio, enabling computers to understand, generate, and reason about sounds and speech by integrating audio data with language understanding. Trained on audio-text data, ALMs bridge the gap between acoustic signals and linguistic meaning, allowing for tasks like zero-shot audio recognition, audio captioning, and the creation of generative audio, such as text-to-audio synthesis.

Voice cloning

Voice cloning is a type of audio deepfake technology that uses machine learning to create a digital replica of a specific person's voice, synthesizing spoken audio that mimics their vocal characteristics like pitch and tone. While it has positive uses, such as generating audiobooks or helping people who have lost their voice, it is also used for malicious purposes, including creating convincing scams where fraudsters impersonate individuals.

Automatic Speech Recognition

Automatic Speech Recognition (ASR) is the technology that converts spoken language into written text. ASR systems process audio signals containing human speech and transcribe them into readable text format. These systems use acoustic models, language models, and often neural networks to recognize phonemes, words, and sentences from audio input. ASR is foundational for applications like voice assistants (Siri, Alexa), transcription services, voice-controlled systems, and accessibility tools for the hearing impaired.

Text-to-speech

Text-to-speech (TTS) is technology that converts written text into natural-sounding audio, also known as "read aloud" technology or speech synthesis. It works by analyzing text to understand words, punctuation, and sentence structure, then generating phonetic representations of those words before synthesizing them into a human-like voice output. TTS is a crucial form of assistive technology and a key component of natural language processing, making digital content accessible and improving user interaction in numerous applications.

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Audio Classification benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000
Audio Classification Benchmarks - Audio - CodeSOTA | CodeSOTA